Agent Evaluation Survey

Chapter 0: The Problem

You build an LLM agent. It can browse the web, write code, call APIs, and reason across multiple steps. It even reflects on its mistakes and retries. It feels impressive in a demo.

But how do you know it works?

Traditional LLM benchmarks — MMLU, GSM8K, HumanEval — test single-turn, text-in text-out performance. An agent is fundamentally different. It operates in a loop: it observes the environment, decides on an action, executes it, observes the result, and repeats. Each step can succeed or fail. The environment changes between steps. The agent might need to call tools, navigate websites, write and debug code, or hold multi-turn conversations.

The core tension: Evaluating an agent is not like evaluating a model. A model produces one answer. An agent produces a trajectory — a sequence of observations, decisions, and actions. Two agents might reach the same correct answer via completely different paths. One might be efficient, the other wasteful. One might be safe, the other dangerous. Measuring just "did it get the right answer?" misses almost everything that matters.

This survey by Yehudai et al. (2025) is the first to comprehensively map how the field evaluates LLM-based agents. They organize the landscape along four dimensions:

Fundamental capabilities — planning, tool use, self-reflection, memory
Application-specific benchmarks — web agents, SWE agents, scientific agents, conversational agents
Generalist agent benchmarks — GAIA, AgentBench, TheAgentCompany
Evaluation frameworks — LangSmith, Vertex AI, Galileo, and the infrastructure for building evals

Along the way, they identify critical gaps: almost nobody measures cost-efficiency, safety, or robustness. Most benchmarks use coarse "did it succeed?" metrics that hide where agents actually fail. And static benchmarks get saturated quickly as models improve.

Agent vs. Model Evaluation

A model takes one input and produces one output. An agent loops through observe-decide-act cycles. Click "Step" to advance the agent through its trajectory. Notice how each step changes the environment state — this is what makes agent evaluation fundamentally harder.

Ready — click Step

Concept: An agent evaluation must assess not just final outcomes but the entire trajectory — each decision, each tool call, each recovery from error. Realization: This is why you cannot simply reuse LLM benchmarks for agents. You need environments that simulate real-world dynamics, metrics that capture intermediate steps, and frameworks that track cost and safety alongside accuracy.

Why can't traditional LLM benchmarks (like MMLU or GSM8K) adequately evaluate an LLM-based agent?

Because agents operate in multi-step loops with dynamic environments, tool calls, and trajectories — traditional benchmarks only test single-turn text-to-text responses and miss the sequential, interactive nature of agent behavior Because agents are always more accurate than models Because traditional benchmarks are too easy for agents

Chapter 1: The Taxonomy

Before we can evaluate agents, we need a map. The survey organizes the entire evaluation landscape into four dimensions. Think of these as concentric circles: from atomic skills outward to complete systems.

Layer 1: Fundamental Capabilities

Can the agent plan, use tools, reflect on mistakes, and maintain memory? These are the atomic building blocks. You test them in isolation: give the agent a planning problem, see if it decomposes it correctly. Give it an API schema, see if it calls the right function.

↓

Layer 2: Application-Specific Benchmarks

Can the agent do a real job? Web navigation, software engineering, scientific research, customer service. Each domain combines capabilities differently and adds domain-specific challenges (e.g., a web agent must parse HTML; a SWE agent must understand codebases).

↓

Layer 3: Generalist Benchmarks

Can the agent handle diverse tasks it has never seen before? GAIA asks 466 real-world questions requiring web search, coding, and multi-modal reasoning together. AgentBench spans OS commands, databases, games, and household tasks. TheAgentCompany simulates an entire software company.

↓

Layer 4: Evaluation Frameworks

How do developers actually build, run, and analyze evals? LangSmith, Vertex AI, Galileo provide monitoring, trajectory analysis, stepwise assessment, and LLM-as-judge pipelines. These are the infrastructure that makes evaluation repeatable and scalable.

Concept: The four layers mirror how agents are built: capabilities compose into applications, applications compose into general intelligence, and frameworks make evaluation systematic. Realization: Most teams only evaluate at layer 2 or 3 (application or generalist). But failures often originate at layer 1 (a broken tool call, a flawed plan). Without capability-level evaluation, you see that the agent failed but not why.

Each layer also has its own evaluation paradigm:

Layer	What you measure	How you measure it	Example benchmarks
Capabilities	Can it plan? Use tools? Reflect?	Isolated tasks, controlled inputs	PlanBench, BFCL, Reflection-Bench
Applications	Can it do this specific job?	Simulated or real environments	WebArena, SWE-Bench, tau-Bench
Generalist	Can it handle anything?	Diverse, multi-skill test suites	GAIA, AgentBench, TheAgentCompany
Frameworks	How well does the eval pipeline work?	Dev tools, monitoring, A/B testing	LangSmith, Galileo, Vertex AI

Evaluation Taxonomy Map

The four layers of agent evaluation, from atomic capabilities to evaluation infrastructure. Hover over each layer to see its key benchmarks and what it measures.

An agent correctly answers a complex question by calling five APIs in sequence, but it takes 47 API calls to get there (42 failed attempts). Which evaluation layer would catch this inefficiency?

Layer 1: Fundamental Capabilities Layer 3: Generalist Benchmarks Layer 4: Evaluation Frameworks — trajectory-level analysis and cost metrics would reveal the 42 wasted calls, while a pass/fail benchmark at layers 2 or 3 would only report "success"

Chapter 2: Planning & Reasoning

Planning is the backbone of agency. Without it, an agent is just a fancy autocomplete that happens to call tools. Planning means decomposing a complex task into subtasks, tracking which subtasks are done, maintaining a belief about the current state, and adapting when things go wrong.

The survey identifies five essential planning abilities:

Ability	What it means	Example benchmark
Task decomposition	Breaking "book a trip" into "find flights" + "find hotels" + "check budget"	Natural Plan
State tracking	Knowing what has been done and what remains	PlanBench, ALFWorld
Self-correction	Detecting errors and recovering — not just continuing blindly	MINT, ToolEmu
Causal understanding	Predicting what an action will do before executing it	ACPBench, FlowBench
Meta-planning	Choosing and refining the planning strategy itself	AutoPlanBench

Concept: Current LLMs are decent at short-term tactical planning (2-5 steps) but struggle dramatically with strategic long-horizon planning (10+ steps). Realization: AutoPlanBench shows that even state-of-the-art LLM agents lag behind classical symbolic planners on problems requiring 10+ steps. The gap isn't in reasoning per se — it's in maintaining consistent state over many steps and recovering from cascading errors.

The benchmarks evolved in waves. First came repurposed reasoning datasets — GSM8K, HotpotQA, MATH — used as proxies for multi-step reasoning. These test the reasoning engine but not the agent loop. Then came agent-specific benchmarks:

MINT (Wang et al., 2023) — evaluates planning in interactive environments. Even GPT-4 struggles beyond 5-step tasks.
PlanBench (Valmeekam et al., 2023) — systematically tests planning across domains. Models excel at tactical planning but fail at strategic.
FlowBench (Xiao et al., 2024) — tests workflow planning in expertise-intensive tasks like medical protocols.
Natural Plan (Zheng et al., 2024) — real-world planning with natural language. Performance drops sharply as complexity increases.
ACPBench (Kokel et al., 2024) — targets core reasoning skills: goal recognition, action feasibility, state prediction.

Planning Horizon vs. Success Rate

As the number of required planning steps increases, agent success rates drop dramatically. Drag the horizon slider to see how quickly performance degrades. The curve is based on patterns reported across PlanBench, MINT, and Natural Plan.

Planning Horizon (steps) 5

The key finding: Planning evaluation has progressed from testing whether an agent can reason through math problems (a proxy) to testing whether it can decompose, track, correct, and adapt across real interactive environments. But the hardest problems — long-horizon, open-ended, real-world planning — remain largely unsolved. The best agents score below 30% on benchmarks like Natural Plan with high-complexity tasks.

According to the survey, what is the biggest gap between current LLM agents and classical symbolic planners?

LLM agents handle short tactical plans well but fail at long-horizon strategic planning (10+ steps) — they lose track of state, fail to recover from errors, and cannot maintain consistent plans over many steps, where symbolic planners still dominate Classical planners are faster LLM agents cannot handle any planning tasks

Chapter 3: Tool Use

An agent that cannot use tools is just an LLM with a loop. Tool use — calling APIs, executing functions, interacting with databases — is what transforms language models into agents that act in the world.

The survey decomposes tool use into a pipeline of sub-tasks:

1. Intent Recognition

Does the user's request require a function call? "What's the weather in Tokyo?" → yes. "Tell me a joke" → probably not.

↓

2. Function Selection

Which tool is appropriate? If the user asks about weather, choose the weather API, not the calculator.

↓

3. Parameter Extraction

Map conversation context to function parameters. "Tokyo" → location="Tokyo". Some parameters may be implicit.

↓

4. Execution

Call the function. Handle errors. Parse the output.

↓

5. Response Generation

Incorporate the tool's output into a natural language response for the user.

Benchmarks for tool use evolved in three generations:

Generation 1: Simple, single-step. ToolAlpaca, APIBench, and BFCL v1 test one-shot function calls with explicitly provided parameters. These are necessary baselines but miss real-world complexity.

Generation 2: Multi-step and stateful. ToolSandbox adds stateful tool execution where one call changes the state for the next. Seal-Tools tests nested tool calls where outputs feed into subsequent calls. NESTFUL evaluates sequences where API outputs chain together.

Generation 3: Realistic and dynamic. ComplexFuncBench tests implicit parameter inference, user-defined constraints, and long-context processing. BFCL v3 incorporates multi-turn, multi-step evaluation with live updates. StableToolBench introduces virtual API servers with caching to handle API instability.

Concept: Tool use evaluation has shifted from "can you fill in function parameters?" to "can you chain tools across a stateful environment where each call changes the world?" Realization: This mirrors the difference between calling a function in isolation vs. writing a real program. The hardest tool use problems involve implicit parameters (not stated in the prompt), dependent outputs (tool B needs tool A's output), and error recovery (tool C fails, now what?).

Tool Call Complexity

Explore how tool use benchmarks have evolved. Click to toggle between simple (single-call), chained (sequential), and nested (dependent outputs) tool evaluation scenarios.

Benchmark	Year	Focus	Key innovation
BFCL	2024	Function calling leaderboard	Live, versioned (v1-v3), multi-turn
ToolBench	2023	16K real APIs	Scale and diversity
ToolSandbox	2024	Stateful execution	State dependencies between calls
ComplexFuncBench	2025	Complex real-world calls	Implicit params, constraints
NESTFUL	2024	Nested API sequences	Output-as-input chaining
StableToolBench	2024	API stability	Virtual server with caching

What makes NESTFUL and ToolSandbox fundamentally harder than early benchmarks like APIBench?

They test nested and stateful tool calls where outputs from one call serve as inputs to subsequent calls and each call changes the environment state — unlike early benchmarks that tested single, independent function calls with explicitly provided parameters They have more APIs to choose from They use harder programming languages

Chapter 4: Web & Software Engineering Agents

This is where agent evaluation gets real. Web agents navigate actual websites. SWE agents fix actual bugs in actual codebases. The stakes are high: these are the benchmarks that most closely predict whether an agent can replace human labor.

Web Agents

Web agent benchmarks evolved from toy to terrifyingly real:

MiniWoB / MiniWoB++ (2017-2018)

Basic simulated web elements. Click buttons, fill forms. Early proof of concept. Simple, static, reproducible.

↓

WebShop / Mind2Web (2022-2023)

Simulated online shopping, broader web interactions. Static datasets enable reproducible evaluation. More realistic but still offline.

↓

WebArena / VisualWebArena (2023-2024)

Dynamic, online environments with realistic UI elements. Agents must interpret visual information, adapt to changing pages, and complete multi-step workflows. The real deal.

↓

WorkArena++ / ST-WebAgentBench (2024-2025)

Enterprise environments: multi-step office tasks, policy compliance, safety protocols. Not just "did it work?" but "did it follow the rules?"

Concept: Web agent evaluation has shifted from "can it click the right button?" to "can it complete a complex enterprise workflow while respecting policies and safety constraints?" Realization: Even state-of-the-art agents achieve only ~15-35% success on WebArena. The gap between demos and reliable deployment is enormous. And almost no benchmark tests policy compliance or safety — the very things that matter most for real deployment.

Software Engineering Agents

SWE-bench changed everything. Instead of synthetic coding puzzles (HumanEval, MBPP), it uses real GitHub issues from real repositories. The agent must read the issue, understand the codebase, find the bug, write a fix, and pass the test suite.

Benchmark	Year	What it tests	Difficulty
HumanEval	2021	Self-contained coding problems	Entry-level
SWE-bench	2023	Real GitHub issues, full repos	Professional
SWE-bench Verified	2024	Curated subset with strong tests	Professional
SWE-bench+	2024	Fixed leakage and weak tests	Professional
SWE-bench Multimodal	2024	Visual software (JS, UI elements)	Professional+
SWELancer	2025	Freelance coding tasks, $ valued	Real-world
ITBench	2025	IT automation tasks	Enterprise

The SWE-bench family keeps getting refined because the original had problems: some solutions leaked through issue descriptions, test cases were too weak (accepting wrong answers), and tasks varied wildly in difficulty. Each variant (Lite, Verified, +, Multimodal) addresses a specific flaw.

Web & SWE Benchmark Evolution

The progression from simple simulations to real-world environments. Each row shows a benchmark's realism level (x-axis) and difficulty (circle size). Drag the timeline to see how evaluation evolved.

Year 2025

The SWELancer insight: By connecting agent performance to monetary value (how much the freelance task pays), SWELancer lets you directly measure the economic viability of agent deployment. If your agent can solve a $500 task but costs $200 in API calls, that is useful. If it costs $600, it is not. This is the first benchmark to make this explicit.

Why does the SWE-bench family keep producing new variants (Lite, Verified, +, Multimodal)?

Because the original had evaluation flaws — solution leakage through issue descriptions, weak test cases that accepted wrong answers, and no visual tasks — each variant addresses a specific flaw to make the benchmark more reliable and comprehensive Because agents have solved the original completely Because each variant uses different programming languages

Chapter 5: Scientific & Conversational Agents

Scientific Agents

Scientific agent evaluation has grown from "can you answer a science question?" to "can you do actual science?" The survey traces this evolution across four stages of the scientific process:

Stage	What the agent does	Benchmark
Ideation	Generate novel, expert-level research ideas	Si et al. (2025)
Experiment design	Plan experiments with proper methodology	AAAR-1.0
Code execution	Write accurate, executable scientific code	SciCode, ScienceAgentBench, CORE-Bench
Peer review	Generate substantive, accurate feedback	Chamoun et al. (2024)

Concept: Testing whether an agent can answer science questions (ScienceQA) is categorically different from testing whether it can do science (DiscoveryWorld). Realization: DiscoveryWorld simulates complete scientific discovery cycles — hypothesis formation, experimentation, result interpretation — across 120 diverse tasks. This is the difference between testing if a student can pass an exam vs. testing if they can run a lab.

Two integrated frameworks stand out:

MLGym (Nathani et al., 2025) — a gym-like environment for AI research tasks. Thirteen challenges simulate real research workflows from hypothesis to analysis. This is the closest thing to "can an agent be a research scientist?"
LAB-Bench (Laurent et al., 2024) — tailored to biological research. Agents must design experiments, interpret tables and images, and synthesize findings. Domain-specific evaluation at its most demanding.

Conversational Agents

Customer service agents are a deceptively hard evaluation target. The agent must hold a multi-turn conversation, follow company policies, call the right functions in the right order, and communicate accurately. A single wrong action — refunding the wrong order, violating a return policy — has real consequences.

The benchmarks here span two approaches:

Trajectory prediction: Given a prefix of a human-agent conversation (with ground truth), predict the next action. ABCD (10K conversations, 55 intents), MultiWOZ, SMCalFlow use this approach.
Full simulation: The agent interacts with a simulated user and environment. It is assessed on bringing the environment to the correct state and communicating the right answer. This is harder and more realistic.

tau-Bench (Yao et al., 2024) is the standout: it emulates conversations between an agent and an LLM-simulated user in airline and retail domains. The benchmark includes databases, APIs, and domain policies. Each task has ground truth for both the expected database write and the correct user response. With 115 retail and 50 airline tasks, it tests whether agents can do the job without violating any rules.

IntellAgent (Levi & Kadar, 2025) goes further: given a database schema and company policies document, it automatically generates evaluation scenarios. It constructs a policy graph, samples policy combinations, creates events, simulates dialogues, and has a critique agent analyze performance. This is evaluation-as-code: no manual test creation needed.

What makes DiscoveryWorld fundamentally different from earlier scientific agent benchmarks like ScienceQA?

It uses harder science questions DiscoveryWorld simulates complete scientific discovery cycles — hypothesis formation, experimentation, and result interpretation — across 120 tasks, testing whether an agent can do science, not just answer questions about it It covers more scientific domains

Chapter 6: Generalist Benchmarks

Application-specific benchmarks tell you if your agent can do one job well. But the promise of LLM agents is generality — one system that can handle web search today, database queries tomorrow, and code review next week. Generalist benchmarks test this breadth.

Three benchmark families dominate:

GAIA (Mialon et al., 2023)

466 human-crafted, real-world questions that require reasoning, multi-modal understanding, web navigation, and tool use — often all at once. A single GAIA question might require searching the web, extracting data from a PDF, computing something with Python, and synthesizing the answer. These are the kinds of tasks a capable human assistant handles routinely but that expose every weakness in an agent.

AgentBench (Liu et al., 2023)

A suite of eight interactive environments: operating system commands, SQL databases, digital games, knowledge graphs, household tasks, web browsing, and more. The key insight is testing the same agent across wildly different domains. An agent that aces web browsing but fails at OS commands reveals brittle, domain-dependent strategies rather than general intelligence.

Real-World Professional Benchmarks

The newest generation pushes even further:

OSWorld (Xie et al., 2024) — full computer operating environments. Write code, handle control flows, coordinate across applications.
TheAgentCompany (Xu et al., 2024) — simulates a software company. Agents browse internal websites, write code, run programs, and communicate with simulated coworkers. The best agent scores around 24%.
CRMArena (Huang et al., 2025) — customer relationship management at enterprise scale. Multi-step operations using both UI and API, with domain-specific policies.
HAL (Holistic Agent Leaderboard) (Stroebl et al., 2025) — aggregates multiple benchmarks into a unified platform covering coding, interactive applications, and safety.

Concept: Generalist benchmarks expose a brutal truth: agents that appear competent in demos often score below 25% on these comprehensive tests. TheAgentCompany's best agent scores ~24%. GAIA's hardest level (Level 3) sees agents below 10%. Realization: The gap between "impressive demo" and "reliable deployment" is not incremental — it is an order of magnitude. Generalist benchmarks are the reality check.

Generalist Benchmark Comparison

Compare the scope and best-known agent performance across major generalist benchmarks. Lower bars = harder benchmarks where current agents still struggle.

Why are generalist benchmarks important even if you only plan to deploy your agent for one specific application?

They are not — application-specific benchmarks are sufficient Because generalist benchmarks reveal whether an agent has brittle, domain-dependent strategies or genuine general capabilities — an agent that only works on its training domain will fail on edge cases and novel situations within that domain too Because generalist benchmarks are easier to set up

Chapter 7: Evaluation Frameworks

Benchmarks tell you what to measure. Frameworks tell you how. They are the infrastructure that makes evaluation repeatable, scalable, and actionable. The survey identifies three levels of granularity at which frameworks operate:

Level 1: Final Response Evaluation

Did the agent produce the right answer? This is the simplest level. Most frameworks use LLM-based judges to evaluate agent responses against predefined criteria. Some (Databricks Mosaic, PatronusAI) offer proprietary judge models trained specifically for this purpose.

Level 2: Stepwise Evaluation

Was each individual action correct? This level assesses tool selection, parameter extraction, and execution output at every step. Galileo introduces an action advancement metric: did this step move the agent closer to the goal? This is subtler than binary pass/fail — a step might be "correct" (valid tool call, correct parameters) but pointless (didn't advance the task).

Level 3: Trajectory-Based Assessment

Was the overall path good? Google Vertex AI and LangSmith analyze the full sequence of steps against an expected optimal path. AgentEvals supports LLM-as-judge evaluation of trajectories with or without reference trajectories. For LangGraph-based agents, it can assess whether the agent followed the expected workflow graph.

Concept: The three levels form a hierarchy: final response is the cheapest but least informative, trajectory is the richest but hardest to judge. Realization: Most teams default to Level 1 (final response) because it is easy. But the actionable insights — "the agent chose the wrong tool at step 3" or "the agent took 47 steps when 5 would suffice" — only come from Levels 2 and 3. The overhead of fine-grained evaluation is the price of understanding your agent.

Framework	Stepwise	Trajectory	Monitoring	A/B Testing	Synth. Data
LangSmith	Yes	Yes	Yes	Yes	No
Langfuse	Yes	No	Yes	Yes	No
Google Vertex AI	Yes	Yes	Yes	Yes	No
Galileo	Yes	No	Yes	Yes	Yes
PatronusAI	Yes	No	Yes	Yes	Yes
Mosaic AI	Yes	No	Yes	Yes	Yes
AgentEvals	No	Yes	No	No	No

Beyond monitoring and assessment, the survey highlights Gym-like environments — inspired by OpenAI Gym — that provide standardized, interactive settings for training and evaluating agents:

BrowserGym (Chezelles et al., 2024) — for web agents
MLGym (Nathani et al., 2025) — for AI research agents
SWE-Gym (Pan et al., 2024) — for software engineering agents

These provide the controlled, repeatable environments that passive monitoring cannot: you can replay trajectories, inject failures, and test edge cases systematically.

What advantage does Galileo's "action advancement metric" have over standard stepwise evaluation?

It measures whether each step advances toward the goal, not just whether the tool call was technically correct — catching cases where an agent makes valid but pointless actions that do not progress the task It is faster to compute It uses better LLM judges

Chapter 8: Critical Gaps

The survey doesn't just catalogue what exists. It identifies what is missing. Four critical gaps threaten to make current evaluation frameworks inadequate as agents move toward real-world deployment.

Gap 1: Granular Evaluation

Most benchmarks report a single number: success rate. Did the agent complete the task? Yes or no. This tells you nothing about where it failed. Was it a bad plan? A wrong tool call? A misunderstood user intent? Without step-level and trajectory-level metrics, you cannot diagnose or fix agent failures.

The fix: Standardized, fine-grained evaluation metrics that capture each decision point. WebCanvas's navigational node completion rates are a step in this direction. LangSmith and Galileo's trajectory analysis is another. But there is no standard yet.

Gap 2: Cost and Efficiency

Current evaluations almost exclusively measure accuracy. But a web agent that succeeds 80% of the time using $0.50 per task is more valuable than one that succeeds 85% using $5.00. As Kapoor et al. (2024) observed, ignoring cost "inadvertently drives the development of highly capable but resource-intensive agents, limiting their practical deployment."

The fix: Integrate cost as a core metric alongside accuracy. Track token usage, API expenses, inference time, and total resource consumption. SWELancer is the first benchmark to link performance to monetary value — more benchmarks should follow.

Gap 3: Safety and Compliance

Almost no benchmark comprehensively tests safety. Can the agent be jailbroken into harmful actions? Does it follow organizational policies? Does it handle adversarial inputs gracefully? ST-WebAgentBench and AgentHarm have started, but the coverage is minimal compared to the risk.

The fix: Multi-dimensional safety benchmarks that simulate adversarial scenarios, test policy compliance, and evaluate behavior in multi-agent settings where emergent risks arise.

Gap 4: Scalable, Continuously Updated Benchmarks

Static benchmarks get saturated. Models improve, the benchmark becomes easy, and the leaderboard stops being informative. Human-annotated benchmarks are expensive to refresh. The field needs automated benchmark generation and live benchmarks that evolve.

The fix: IntellAgent's automatic benchmark generation, BFCL's live versioning, and Agent-as-a-Judge (using LLM agents to evaluate other agents) all point toward scalable evaluation. But these approaches need validation — automated judges can have their own biases.

Concept: The four gaps — granularity, cost, safety, and freshness — are not independent. They compound. Realization: An agent that appears safe in a static benchmark might be unsafe in a dynamic one. An agent that appears efficient on a coarse metric might be wasteful at the step level. And a benchmark that was valid last month might be saturated this month. Solving each gap in isolation is not enough — the field needs evaluation systems that address all four simultaneously.

Evaluation Gap Radar

How well do current benchmarks cover each critical dimension? The shaded area shows the current state of coverage. Click a gap label to learn what is missing.

A company deploys an agent that scores 90% on SWE-bench Verified. In production, it costs $12 per task (3x human cost), occasionally violates data privacy policies, and its performance drops to 70% on new repositories not in the benchmark. Which gaps does this scenario illustrate?

All four gaps: cost-efficiency ($12 vs. human cost), safety (privacy violations), granularity (90% hides the 70% on new repos), and freshness (benchmark didn't include the new repositories) — each gap contributed to a gap between benchmark performance and real-world viability Only the cost gap Only the safety gap

Chapter 9: Connections

This survey sits at the center of a rapidly expanding web of agent research. Here is how it connects to other key work:

Topic	Connection
Meta-Harness	If you want to build your own evaluation harness, start here. Meta's framework for evaluating coding agents directly implements the trajectory-based assessment discussed in this survey.
Dive into Claude Code	A concrete example of an SWE agent in the wild. Claude Code's architecture — tool use, planning, self-reflection, memory — maps directly onto the four fundamental capabilities this survey evaluates.
RLHF / DPO / RLEF	How do you train agents to get better? RLHF and DPO optimize for human preferences; RLEF (RL from Execution Feedback) optimizes for task completion. The evaluation methods surveyed here become the reward signals for these training algorithms.
SayCan / RT-2 / pi0	Embodied agents face the same evaluation challenges but in the physical world. The survey's framework extends naturally: capabilities (manipulation, navigation), applications (household, industrial), and generalist (diverse tasks in simulation).
Kimi-K2 / DeepSeek-V3	The underlying LLMs that power agents. Better base models (larger context, better reasoning) directly improve agent performance on these benchmarks — but the evaluation still requires agent-specific methodology.

Cheat Sheet

Dimension	Key Benchmarks	What they test	Current best %
Planning	PlanBench, MINT, Natural Plan	Multi-step decomposition, state tracking	<30% (high complexity)
Tool Use	BFCL, ToolSandbox, NESTFUL	Function calling, chaining, statefulness	~70-80% (simple), ~40% (complex)
Web Agents	WebArena, WorkArena++	Dynamic web navigation, enterprise tasks	~15-35%
SWE Agents	SWE-bench Verified, SWELancer	Real GitHub bugs, freelance tasks	~50-72% (Verified)
Scientific	DiscoveryWorld, MLGym, CORE-Bench	Full research cycles, reproducibility	~20-50%
Conversational	tau-Bench, IntellAgent	Policy compliance, multi-turn dialogue	~40-65%
Generalist	GAIA, TheAgentCompany, HAL	Diverse, multi-skill tasks	~2-55% (varies by level)

The meta-insight of this survey: Agent evaluation is still in its early days. The field knows how to measure whether an agent gets the right answer, but barely knows how to measure whether it got there safely, efficiently, or for the right reasons. As agents become more autonomous and take more consequential actions, closing these evaluation gaps becomes not just a research problem but a safety imperative.

If you were designing an evaluation framework for a new LLM-based coding agent, which combination of elements from this survey would give you the most comprehensive assessment?

SWE-bench Verified alone BFCL + HumanEval Capability-level tests (planning via PlanBench, tool use via BFCL), application-specific tests (SWE-bench Verified for bug fixing, TDD-Bench for test generation), generalist breadth (GAIA or AgentBench), plus a framework like LangSmith for trajectory-level cost and step analysis — covering all four survey dimensions

Survey on Evaluation of LLM-based Agents