Agrawal, Tan, Soylu, Ziems, Khare, Opsahl-Ong, Singhvi, et al. — UC Berkeley, Stanford, Notre Dame, Databricks, MIT, 2026

GEPA: Reflective Prompt Evolution

RL needs 24,000 rollouts to adapt an LLM. What if you could read the execution trace in natural language, diagnose what went wrong, and fix the prompt directly? GEPA does exactly that — and outperforms GRPO with 35x fewer samples.

Prerequisites: LLM prompting basics + Compound AI systems + Intuition for RL/GRPO
10
Chapters
4
Simulations

Chapter 0: The Problem

You have a multi-hop question-answering system. It uses an LLM to generate a search query, retrieves documents, then uses the LLM again to synthesize an answer. The system works, but it only gets 42% accuracy on HotpotQA. You want to improve it.

The standard approach in 2025–2026 is reinforcement learning. Specifically, GRPO (Group Relative Policy Optimization) — you run the system thousands of times, observe which runs succeed and which fail, and use those scalar rewards to update the model weights via policy gradients.

There is a brutal cost. GRPO needs 24,000 rollouts to learn a task. Each rollout means running your full AI system end-to-end — LLM calls, tool invocations, retrieval, the works. If each rollout costs $0.05 in API calls, that is $1,200 per task. If you are using a proprietary model like GPT-4.1, you cannot even update the weights at all.

The core tension: RL collapses rich execution traces — all the reasoning, the tool calls, the error messages, the intermediate outputs — into a single scalar: 0 (wrong) or 1 (right). That is like reading a student's 10-page exam, throwing away everything except the final grade, and trying to teach them from that alone. All the diagnostic information about what went wrong and why is lost.

What if there were a way to learn from those traces directly? To read what the system did, understand why it failed, and fix the prompt accordingly — all in natural language, with no weight updates needed?

That is the question GEPA answers.

Rollout Efficiency: GRPO vs GEPA

Drag the budget slider to see how many rollouts each method needs. GEPA reaches strong performance with a fraction of GRPO's budget. The dashed line shows GRPO's final score after 24K rollouts.

Rollout Budget 5,000
Why is GRPO sample-inefficient for adapting compound AI systems?

Chapter 1: The Key Insight

Here is the observation that makes GEPA possible: LLM execution traces are natural language.

When a compound AI system runs, it produces a stream of text: the system prompt, the LLM's chain-of-thought reasoning, the tool calls it makes, the tool outputs it receives, the intermediate results it passes between modules, and — crucially — the evaluation feedback (compiler errors, failed rubrics, mismatched answers).

All of this is text. And modern LLMs are extraordinarily good at reading text, diagnosing problems in it, and proposing fixes.

RL Approach (GRPO)
Run system → get output → compare to gold answer → reward = 0 or 1 → policy gradient update to model weights. The reasoning trace is discarded.
↓ vs ↓
GEPA Approach
Run system → capture full execution trace (reasoning + tool calls + errors) → LLM reads trace → diagnoses what went wrong → proposes specific prompt changes. The trace IS the learning signal.
Why this is more informative: A scalar reward tells you "the system failed." A trace tells you "the system failed because the second-hop query was too vague — it repeated the original question instead of targeting the missing entity mentioned in the first-hop summary." The second signal lets you make a targeted fix: "add an instruction to the query-generation prompt that says: identify entities in summary_1 that aren't covered by the first hop, and search for those."

This is what the authors call reflective mutation — using an LLM to reflect on execution traces and mutate prompts based on that reflection. It is the core mechanism of GEPA, and it is why GEPA can learn from as few as 6 training rollouts on some tasks.

The formal claim: Natural language traces, paired with structured feedback from the evaluation function, provide a much richer learning signal than scalar rewards. This lets GEPA be up to 35x more sample-efficient than GRPO. The key: the LLM performing the reflection has strong language priors — it already understands what good instructions look like, what common failure modes are, and how to fix them.
What makes execution traces a richer learning signal than scalar rewards?

Chapter 2: Compound AI Systems

Before we can optimize prompts, we need a precise definition of what we are optimizing. GEPA targets compound AI systems — not single LLM calls, but multi-module pipelines with control flow, tools, and multiple prompts.

Formal Definition

A compound AI system is a tuple Φ = (M, C, X, Y) where:

Concrete example — multi-hop QA system: Module M1 generates a first-hop search query from the question. The control flow C calls a retrieval tool, gets documents, summarizes them. Module M2 generates a second-hop query using the question + first-hop summary. Control flow retrieves again. Module M3 synthesizes the final answer from all retrieved information. Each module has its own prompt πi — GEPA can optimize all of them.

The Optimization Target

Let ΠΦ = ⟨π1, ..., π|M|⟩ be the collection of all prompts and ΘΦ = ⟨θ1, ..., θ|M|⟩ the model weights. For a task instance (x, m) with input x and evaluator metadata m (gold answers, rubrics), the system produces output y = Φ(x; ⟨Π, Θ⟩Φ). A metric μ : Y × M → [0,1] scores the output quality.

⟨Π*, Θ*⟩Φ = arg max⟨Π,Θ⟩ E(x,m)~T [ μ( Φ(x; ⟨Π, Θ⟩Φ), m ) ]

GEPA focuses on optimizing only the prompts ΠΦ, keeping the weights ΘΦ frozen. This is what makes it work with any LLM — including proprietary APIs where weight updates are impossible.

The Budget Constraint

Rollouts are expensive. The optimizer is limited to at most B rollouts. The real challenge: extract maximal learning signal from every single rollout.

⟨Π*, Θ*⟩Φ = arg max⟨Π,Θ⟩ E(x,m)~T [ μ( Φ(x; ⟨Π, Θ⟩Φ), m ) ],   s.t. #rollouts ≤ B
Why compound systems make this harder (and more interesting): With a single-module system, there is one prompt to optimize. With a compound system, there are |M| prompts, and they interact. Improving the query-generation prompt might make the answer-synthesis prompt work worse if it was calibrated to the old query style. GEPA handles this by selecting which module to update (round-robin policy) and evaluating the whole system end-to-end after each change.
In the formal definition, what does GEPA optimize and what does it keep fixed?

Chapter 3: Reflective Mutation

This is the heart of GEPA. Reflective mutation is the process by which GEPA reads an execution trace, diagnoses what went wrong, and proposes a specific prompt change. Let's walk through it step by step.

Step 1: Execute and Trace

GEPA runs the current candidate system on a minibatch of training examples. It captures the full execution trace: every module's input, the LLM's reasoning, tool calls, tool outputs, and final answer. It also calls the feedback function μf, which returns both a numeric score and text feedback (e.g., compiler errors, which rubrics failed, the gold answer for comparison).

Step 2: Select a Module

GEPA selects which module to update via a round-robin policy. In a 3-module system, it cycles through M1, M2, M3, M1, ... This ensures all modules get optimized, not just the one that happens to fail most visibly.

Step 3: Reflect

A reflection LLM is shown the current prompt, the execution trace for the selected module, the score, and the text feedback. Its task: diagnose what went wrong and propose specific changes to the prompt.

What the reflection LLM sees: (1) The current prompt for the module being optimized. (2) The full trajectory: what the module received as input, what it reasoned, what it outputted. (3) The final score and text feedback (e.g., "Expected answer: 'Madeira archipelago', Got: 'Funchal municipality' — the second-hop query was too narrow, searching for the parish instead of the broader region"). (4) The accumulated lessons from the candidate's ancestry — what was tried before and what worked.

Step 4: Mutate

Based on the reflection, the LLM proposes an updated prompt. This is not a random perturbation — it is a targeted edit informed by concrete failure analysis. The new prompt inherits all the lessons from its parent and adds the new insight.

Step 5: Evaluate

The updated system is re-run on the same minibatch. If the score improves, the new candidate is added to the pool. If not, it is discarded. No wasted budget on bad ideas.

Two sources of diagnostic signal: GEPA identifies two distinct types of traces. Execution traces are what the LLM produces (reasoning, tool calls). Evaluation traces are what the environment produces when computing the reward (compiler errors, failed test cases, mismatched rubrics). Both are natural language. Both are fed to the reflection LLM. Many evaluation metrics produce rich text before collapsing to a scalar — GEPA captures that text before the collapse.
Reflective Mutation in Action

Click "Run System" to see an execution trace, then "Reflect" to see the LLM's diagnosis, then "Mutate" to see the resulting prompt change.

Why reflection beats random mutation: Without reflection, you could still do evolutionary prompt optimization — randomly perturb prompts and keep the best. But random mutation has no idea what to change. It might flip irrelevant words while ignoring the actual problem. Reflective mutation is targeted: it reads the failure, understands the cause, and surgically fixes the prompt. This is why GEPA can match GRPO's performance with as few as 6 training rollouts on some tasks (102, 32, 6, and 179 training rollouts across four tasks).
What makes reflective mutation fundamentally different from random prompt perturbation?

Chapter 4: The Pareto Frontier

Reflective mutation is powerful, but it has a failure mode: local optima. If you always evolve the best-performing candidate, you get trapped. You find one good strategy, keep refining it, and exhaust your budget without discovering a fundamentally different (and better) approach.

Figure 6 in the paper shows this vividly. With a greedy "always pick the best" strategy (like TextGrad uses), the optimization tree degenerates into a long chain — one child after another, each a minor tweak of the same idea. After the first improvement, the search stalls and burns rollouts trying to squeeze out more from a single strategy.

Pareto-Based Candidate Selection

GEPA's solution is elegant: instead of tracking one "best" candidate, it maintains a Pareto frontier across all training instances.

Here is how it works. For each training example i, GEPA records the score of every candidate. A candidate is on the Pareto frontier if it achieves the best score on at least one training example. Candidates that are strictly dominated (beaten by another candidate on every example) are pruned.

Concrete example: Suppose you have 3 training examples and 4 candidates. Candidate A scores [0.8, 0.3, 0.5]. Candidate B scores [0.5, 0.9, 0.4]. Candidate C scores [0.4, 0.4, 0.95]. Candidate D scores [0.3, 0.3, 0.3]. Candidates A, B, and C are all on the Pareto frontier — each is the best at something. D is dominated (beaten on all examples) and gets pruned. GEPA will sample from {A, B, C}, weighted by how many examples each leads on.

Why Pareto Prevents Local Optima

A candidate that solves a specific type of problem well might encode a strategy that the globally best candidate lacks. By keeping it in the pool and sometimes evolving from it, GEPA explores diverse regions of prompt space. The search tree becomes a tree, not a chain.

Ablation proof: Table 3 in the paper compares three selection strategies on Qwen3 8B across four tasks. SelectBestCandidate (greedy, used by TextGrad): +6.05% aggregate improvement. BeamSearch (top-N pool, used by APO): +5.11%. GEPA's Pareto-based selection: +12.44%. That is more than double the greedy strategy. On PUPA specifically, Pareto gives +11.03% where greedy gives only +4.63%.

Sampling from the Frontier

When GEPA needs to select a candidate to mutate, it:

  1. Computes the Pareto frontier (non-dominated candidates)
  2. Counts how many training examples each frontier candidate leads on (call this f[Φ])
  3. Samples a candidate with probability proportional to f[Φ]

Candidates that lead on many examples get sampled more often (exploitation), but candidates that lead on even just one example still get a chance (exploration). This naturally balances the exploration–exploitation tradeoff.

Why does GEPA use Pareto-based selection instead of always evolving the best candidate?

Chapter 5: The Full Algorithm

Now let's assemble the pieces. GEPA's optimization loop combines reflective mutation with Pareto-based candidate selection in a genetic framework.

Initialization

GEPA starts with a single candidate: the base system Φ with its seed prompts. The training data Dtrain is split into Dfeedback (used for rollouts and reflection) and Dpareto (used for scoring candidates on the Pareto frontier). The candidate pool P contains only the base system initially.

The Loop (while budget B > 0)

1. Select Candidate
Use Pareto-based selection (Algorithm 2) to choose a candidate Φk from the frontier. Candidates leading on more examples are sampled more often.
2. Select Module
Choose which module j to update (round-robin across the |M| modules).
3. Execute + Gather Feedback
Run Φk on a minibatch M from Dfeedback. Trace the execution. Call μf for scores and text feedback.
4. Reflective Mutation
Show the reflection LLM the (prompt, trace, score, feedback). It proposes an updated prompt πj'.
5. Minibatch Check
Re-run the updated system on the same minibatch. If the score improved, proceed. If not, discard and loop back.
6. Full Evaluation
Evaluate the new candidate on all of Dpareto. Add it to the pool P with ancestry records.

System-Aware Merge (Crossover)

In addition to mutation, GEPA supports a crossover operation called System-Aware Merge. When two candidates have evolved different modules independently (one improved the query module, another improved the synthesis module), Merge combines them by taking the best version of each module from each lineage. This is unique to compound AI systems — you can't do this with single-prompt optimization.

The complete data flow: System Φ(x) → trajectory τ (reasoning + tool calls + outputs) → reflection (natural language diagnosis) → mutation (prompt edit) → evaluation (metric μ) → Pareto update. Each iteration uses a small number of rollouts (minibatch + validation), and the accumulated lessons flow through the genetic tree via ancestry records.

Termination

When the budget is exhausted, GEPA returns the candidate with the best aggregate score on Dpareto.

Budget allocation: The majority of GEPA's rollout budget goes to validation (scoring candidates on Dpareto for Pareto selection), not to generating learning signals. The actual training rollouts (used for reflection) can be as few as 6–179 across four tasks. If we count only training rollouts, GEPA matches GRPO's best validation with 102, 32, 6, and 179 rollouts respectively — up to 78x more efficient.
What happens when a reflective mutation does NOT improve the minibatch score?

Chapter 6: Results vs GRPO

The headline result: across six benchmarks on Qwen3 8B, GEPA outperforms GRPO (24K rollouts) by +6% average and up to +20%, while using up to 35x fewer rollouts.

Task-by-Task on Qwen3 8B

TaskBaselineGRPO (24K)GEPAGEPA Rollouts
HotpotQA42.3%43.3%62.3%6,871
IFBench36.9%35.9%38.6%3,593
HoVer35.3%38.7%52.3%7,051
PUPA80.8%86.7%91.9%2,426
AIME-202527.3%38.0%32.0%1,839
LiveBench-Math48.7%51.3%52.0%1,839
The HotpotQA result is striking: GRPO barely moves the needle (+1% over baseline with 24K rollouts). GEPA jumps +20% with 6,871 rollouts. The multi-hop QA system has multiple modules and complex interactions — exactly the setting where reflective diagnosis of traces shines. GRPO's scalar reward cannot tell the system which hop went wrong.

AIME-2025 is the one exception where GRPO outperforms GEPA. This is a math reasoning task where the primary bottleneck is the LLM's mathematical ability, not its instruction-following. Weight updates (which GRPO does) can teach new mathematical reasoning patterns; prompt changes alone cannot. This is an honest limitation of prompt optimization.

Sample Efficiency Deep Dive

To match GRPO's best validation scores, GEPA needs only:

If we count only the training rollouts (not validation), the numbers are even more dramatic: 102, 32, 6, and 179 rollouts.

Learning Curves: GEPA vs GRPO vs MIPROv2

Toggle between tasks to see how each method's performance improves with rollout budget. GEPA (orange) reaches high performance early; GRPO (blue) climbs slowly over 24K rollouts.

On which task does GRPO outperform GEPA, and why?

Chapter 7: Results vs MIPROv2

MIPROv2 is the prior state-of-the-art prompt optimizer (from the DSPy ecosystem). It optimizes both instructions and few-shot demonstrations jointly. GEPA outperforms it on every benchmark and every model.

Qwen3 8B Comparison

MethodAggregate ScoreImprovement over Baseline
Baseline45.2%
MIPROv247.8%+2.6%
GRPO (24K)48.9%+3.7%
GEPA54.8%+9.6%

GPT-4.1 Mini Comparison

MethodAggregate ScoreImprovement
Baseline53.0%
Trace (OptoPrime)56.3%+3.3%
MIPROv258.7%+5.6%
TextGrad59.1%+6.1%
GEPA65.2%+12.2%
GEPA+Merge66.4%+13.3%
AIME-2025 spotlight: On AIME-2025 with GPT-4.1 Mini, GEPA achieves 59.3% vs MIPROv2's 51.3% — a +8 point gap. On the same benchmark with Qwen3 8B, GEPA gets 32.0% vs MIPROv2's 20.0% — a +12 point gap. Prompt optimization alone pushes a small open-source model to solve nearly a third of competition-level math problems.

Why GEPA Beats MIPROv2

Two key differences explain the gap:

Cross-model generalization: Here is a remarkable finding. GEPA-optimized prompts for Qwen3-8B, when evaluated on GPT-4.1 Mini without modification, achieve +9% aggregate improvement. This outperforms MIPROv2 (+5.6%), TextGrad (+6.1%), and Trace (+3.3%) — all of which were optimized directly on GPT-4.1 Mini. The prompts learned from a weaker model transfer to a stronger one because they encode strategies, not model-specific tricks.
Why are GEPA's prompts more efficient than MIPROv2's despite achieving higher performance?

Chapter 8: What Optimized Prompts Look Like

One of the most compelling aspects of GEPA is that its outputs are human-readable. Unlike RL-optimized weights (which are inscrutable floating-point tensors), GEPA's optimized prompts are natural language instructions that you can read, understand, and even learn from.

Example: Multi-Hop QA (HotpotQA)

The seed prompt for the second-hop query module is minimal:

Seed prompt: "Given the fields question, summary_1, produce the fields query."

After GEPA's optimization, this becomes a rich, multi-section instruction:

GEPA's optimized prompt (excerpt): "You will be given two input fields: question and summary_1. Your task: Generate a new search query optimized for the second hop of a multi-hop retrieval system. The original user question is typically complex and requires information from multiple documents. Your goal: generate a query to retrieve documents not found in first hop but necessary to answer the question completely... Identify entities or topics mentioned in summary_1 that are related but different from first-hop documents. Reframe the query to explicitly mention these broader or related entities... Ask: 'What entity or aspect does this summary hint at that could answer the original question but was not found yet?'"

What GEPA Learned

Look at what the optimization discovered: the seed prompt gives zero guidance. GEPA's prompt contains specific strategies derived from observing failures:

Evolution Trajectory: PUPA (Privacy Task)

Figure 5 in the paper shows the optimization tree for the PUPA privacy delegation task. Each node is a candidate with its score. The progression from seed (82.3%) to best (97.6%) accumulates specific lessons:

  1. Candidate 0 (82.3%): Base instruction — "create a privacy-preserving request"
  2. Candidate 3 (87.8%): Adds detailed PII identification and generalization strategies
  3. Candidate 4 (94.4%): Formalizes output into Reasoning + Request sections. Bans names/codes explicitly.
  4. Candidate 5 (94.7%): Adds transparent transformation rationale. Handles edge cases (fictional characters, professional scenarios).
  5. Candidate 11 (97.6%): Rigorous stepwise protocol. Zero-leakage tolerance. Auditable privacy guarantees.
Before vs After: Prompt Comparison

Click to toggle between the seed prompt and GEPA's optimized version for the multi-hop QA second-hop query module. Notice how GEPA adds concrete strategies learned from failure analysis.

Prompts as knowledge artifacts: GEPA's optimized prompts are not just instructions — they are distilled knowledge. Each sentence encodes a lesson learned from specific failures. A human reading the optimized prompt would learn the same strategies that took GEPA hundreds of rollouts to discover. This is a unique advantage over weight-space optimization: the learned knowledge is interpretable.
What kind of knowledge does GEPA embed in its optimized prompts?

Chapter 9: Connections

GEPA sits at the intersection of several active research threads. Let's map where it fits.

Relation to DSPy / MIPROv2

GEPA builds on the DSPy framework's formalism of compound AI systems (Φ = (M, C, X, Y)). MIPROv2 is DSPy's prompt optimizer, focusing on joint instruction + few-shot optimization. GEPA outperforms MIPROv2 by replacing proposal-based optimization with reflection-based optimization and by using Pareto selection instead of Bayesian optimization.

Relation to TextGrad / Trace

TextGrad and Trace (with OptoPrime) are text-gradient methods that also use language feedback for optimization. But they use a greedy "select best candidate" strategy, which GEPA's Pareto ablation shows leads to +6.05% vs GEPA's +12.44%. GEPA also captures richer feedback via execution traces, not just output-level gradients.

Relation to GRPO / RLVR

GRPO and GEPA solve the same optimization problem (Equation 1) but in different parameter spaces. GRPO updates weights Θ via policy gradients from scalar rewards. GEPA updates prompts Π via natural language reflection. The key tradeoff: GRPO can teach genuinely new capabilities (as seen on AIME), but at 24K+ rollouts. GEPA is 35x more efficient for tasks where the LLM already has the capability but needs better instructions.

Relation to OPRO / Meta-Harness

OPRO (Yang et al., 2023) was an early prompt optimizer using LLM-based optimization. GEPA advances this by adding structured execution trace reflection and Pareto-based candidate selection. Meta-Harness is related work on optimizing prompts for AI systems.

Relation to AlphaEvolve

AlphaEvolve (Google DeepMind) also uses LLM-based evolutionary search for code optimization. GEPA shares the evolutionary framework but focuses on prompt optimization for compound AI systems, with the unique addition of reflective mutation from execution traces.

GEPA as Inference-Time Search

Beyond task adaptation, GEPA can be repurposed for inference-time code optimization. On NPUEval (AMD kernel generation), GEPA boosts GPT-4o from 4.25% to 30.52% vector utilization. On KernelBench (CUDA), it pushes near-0% fast1 score above 20%. The key: domain documentation is injected into the feedback function μf, surfacing relevant manual sections based on compiler errors.

Cheat Sheet

AspectGEPA
What it optimizesPrompts ΠΦ (weights frozen)
Key mechanismReflective mutation from execution traces
Selection strategyPareto frontier (multi-objective)
CrossoverSystem-aware merge (combine best modules)
Works withAny LLM (open or proprietary)
vs GRPO+6% avg, up to +20%, 35x fewer rollouts
vs MIPROv2+10% aggregate, prompts 9.2x shorter
Best result+12% on AIME-2025 (GPT-4.1 Mini)
Cross-model transferQwen3 prompts → +9% on GPT-4.1 Mini
Min training rollouts6 rollouts to match GRPO on one task
Models testedQwen3-8B, GPT-4.1 Mini, GPT-4o
TasksHotpotQA, AIME, LiveBench, IFBench, PUPA, HoVer
The broader lesson: When your system produces rich natural-language traces, don't throw them away by collapsing to scalar rewards. Read them. Diagnose them. Learn from them. GEPA shows that this simple principle — reflection in language — can outperform sophisticated RL with a fraction of the data.
When should you use GRPO over GEPA?