RL needs 24,000 rollouts to adapt an LLM. What if you could read the execution trace in natural language, diagnose what went wrong, and fix the prompt directly? GEPA does exactly that — and outperforms GRPO with 35x fewer samples.
You have a multi-hop question-answering system. It uses an LLM to generate a search query, retrieves documents, then uses the LLM again to synthesize an answer. The system works, but it only gets 42% accuracy on HotpotQA. You want to improve it.
The standard approach in 2025–2026 is reinforcement learning. Specifically, GRPO (Group Relative Policy Optimization) — you run the system thousands of times, observe which runs succeed and which fail, and use those scalar rewards to update the model weights via policy gradients.
There is a brutal cost. GRPO needs 24,000 rollouts to learn a task. Each rollout means running your full AI system end-to-end — LLM calls, tool invocations, retrieval, the works. If each rollout costs $0.05 in API calls, that is $1,200 per task. If you are using a proprietary model like GPT-4.1, you cannot even update the weights at all.
What if there were a way to learn from those traces directly? To read what the system did, understand why it failed, and fix the prompt accordingly — all in natural language, with no weight updates needed?
That is the question GEPA answers.
Drag the budget slider to see how many rollouts each method needs. GEPA reaches strong performance with a fraction of GRPO's budget. The dashed line shows GRPO's final score after 24K rollouts.
Here is the observation that makes GEPA possible: LLM execution traces are natural language.
When a compound AI system runs, it produces a stream of text: the system prompt, the LLM's chain-of-thought reasoning, the tool calls it makes, the tool outputs it receives, the intermediate results it passes between modules, and — crucially — the evaluation feedback (compiler errors, failed rubrics, mismatched answers).
All of this is text. And modern LLMs are extraordinarily good at reading text, diagnosing problems in it, and proposing fixes.
This is what the authors call reflective mutation — using an LLM to reflect on execution traces and mutate prompts based on that reflection. It is the core mechanism of GEPA, and it is why GEPA can learn from as few as 6 training rollouts on some tasks.
Before we can optimize prompts, we need a precise definition of what we are optimizing. GEPA targets compound AI systems — not single LLM calls, but multi-module pipelines with control flow, tools, and multiple prompts.
A compound AI system is a tuple Φ = (M, C, X, Y) where:
Let ΠΦ = ⟨π1, ..., π|M|⟩ be the collection of all prompts and ΘΦ = ⟨θ1, ..., θ|M|⟩ the model weights. For a task instance (x, m) with input x and evaluator metadata m (gold answers, rubrics), the system produces output y = Φ(x; ⟨Π, Θ⟩Φ). A metric μ : Y × M → [0,1] scores the output quality.
GEPA focuses on optimizing only the prompts ΠΦ, keeping the weights ΘΦ frozen. This is what makes it work with any LLM — including proprietary APIs where weight updates are impossible.
Rollouts are expensive. The optimizer is limited to at most B rollouts. The real challenge: extract maximal learning signal from every single rollout.
This is the heart of GEPA. Reflective mutation is the process by which GEPA reads an execution trace, diagnoses what went wrong, and proposes a specific prompt change. Let's walk through it step by step.
GEPA runs the current candidate system on a minibatch of training examples. It captures the full execution trace: every module's input, the LLM's reasoning, tool calls, tool outputs, and final answer. It also calls the feedback function μf, which returns both a numeric score and text feedback (e.g., compiler errors, which rubrics failed, the gold answer for comparison).
GEPA selects which module to update via a round-robin policy. In a 3-module system, it cycles through M1, M2, M3, M1, ... This ensures all modules get optimized, not just the one that happens to fail most visibly.
A reflection LLM is shown the current prompt, the execution trace for the selected module, the score, and the text feedback. Its task: diagnose what went wrong and propose specific changes to the prompt.
Based on the reflection, the LLM proposes an updated prompt. This is not a random perturbation — it is a targeted edit informed by concrete failure analysis. The new prompt inherits all the lessons from its parent and adds the new insight.
The updated system is re-run on the same minibatch. If the score improves, the new candidate is added to the pool. If not, it is discarded. No wasted budget on bad ideas.
Click "Run System" to see an execution trace, then "Reflect" to see the LLM's diagnosis, then "Mutate" to see the resulting prompt change.
Reflective mutation is powerful, but it has a failure mode: local optima. If you always evolve the best-performing candidate, you get trapped. You find one good strategy, keep refining it, and exhaust your budget without discovering a fundamentally different (and better) approach.
Figure 6 in the paper shows this vividly. With a greedy "always pick the best" strategy (like TextGrad uses), the optimization tree degenerates into a long chain — one child after another, each a minor tweak of the same idea. After the first improvement, the search stalls and burns rollouts trying to squeeze out more from a single strategy.
GEPA's solution is elegant: instead of tracking one "best" candidate, it maintains a Pareto frontier across all training instances.
Here is how it works. For each training example i, GEPA records the score of every candidate. A candidate is on the Pareto frontier if it achieves the best score on at least one training example. Candidates that are strictly dominated (beaten by another candidate on every example) are pruned.
A candidate that solves a specific type of problem well might encode a strategy that the globally best candidate lacks. By keeping it in the pool and sometimes evolving from it, GEPA explores diverse regions of prompt space. The search tree becomes a tree, not a chain.
When GEPA needs to select a candidate to mutate, it:
Candidates that lead on many examples get sampled more often (exploitation), but candidates that lead on even just one example still get a chance (exploration). This naturally balances the exploration–exploitation tradeoff.
Now let's assemble the pieces. GEPA's optimization loop combines reflective mutation with Pareto-based candidate selection in a genetic framework.
GEPA starts with a single candidate: the base system Φ with its seed prompts. The training data Dtrain is split into Dfeedback (used for rollouts and reflection) and Dpareto (used for scoring candidates on the Pareto frontier). The candidate pool P contains only the base system initially.
In addition to mutation, GEPA supports a crossover operation called System-Aware Merge. When two candidates have evolved different modules independently (one improved the query module, another improved the synthesis module), Merge combines them by taking the best version of each module from each lineage. This is unique to compound AI systems — you can't do this with single-prompt optimization.
When the budget is exhausted, GEPA returns the candidate with the best aggregate score on Dpareto.
The headline result: across six benchmarks on Qwen3 8B, GEPA outperforms GRPO (24K rollouts) by +6% average and up to +20%, while using up to 35x fewer rollouts.
| Task | Baseline | GRPO (24K) | GEPA | GEPA Rollouts |
|---|---|---|---|---|
| HotpotQA | 42.3% | 43.3% | 62.3% | 6,871 |
| IFBench | 36.9% | 35.9% | 38.6% | 3,593 |
| HoVer | 35.3% | 38.7% | 52.3% | 7,051 |
| PUPA | 80.8% | 86.7% | 91.9% | 2,426 |
| AIME-2025 | 27.3% | 38.0% | 32.0% | 1,839 |
| LiveBench-Math | 48.7% | 51.3% | 52.0% | 1,839 |
AIME-2025 is the one exception where GRPO outperforms GEPA. This is a math reasoning task where the primary bottleneck is the LLM's mathematical ability, not its instruction-following. Weight updates (which GRPO does) can teach new mathematical reasoning patterns; prompt changes alone cannot. This is an honest limitation of prompt optimization.
To match GRPO's best validation scores, GEPA needs only:
If we count only the training rollouts (not validation), the numbers are even more dramatic: 102, 32, 6, and 179 rollouts.
Toggle between tasks to see how each method's performance improves with rollout budget. GEPA (orange) reaches high performance early; GRPO (blue) climbs slowly over 24K rollouts.
MIPROv2 is the prior state-of-the-art prompt optimizer (from the DSPy ecosystem). It optimizes both instructions and few-shot demonstrations jointly. GEPA outperforms it on every benchmark and every model.
| Method | Aggregate Score | Improvement over Baseline |
|---|---|---|
| Baseline | 45.2% | — |
| MIPROv2 | 47.8% | +2.6% |
| GRPO (24K) | 48.9% | +3.7% |
| GEPA | 54.8% | +9.6% |
| Method | Aggregate Score | Improvement |
|---|---|---|
| Baseline | 53.0% | — |
| Trace (OptoPrime) | 56.3% | +3.3% |
| MIPROv2 | 58.7% | +5.6% |
| TextGrad | 59.1% | +6.1% |
| GEPA | 65.2% | +12.2% |
| GEPA+Merge | 66.4% | +13.3% |
Two key differences explain the gap:
One of the most compelling aspects of GEPA is that its outputs are human-readable. Unlike RL-optimized weights (which are inscrutable floating-point tensors), GEPA's optimized prompts are natural language instructions that you can read, understand, and even learn from.
The seed prompt for the second-hop query module is minimal:
After GEPA's optimization, this becomes a rich, multi-section instruction:
Look at what the optimization discovered: the seed prompt gives zero guidance. GEPA's prompt contains specific strategies derived from observing failures:
Figure 5 in the paper shows the optimization tree for the PUPA privacy delegation task. Each node is a candidate with its score. The progression from seed (82.3%) to best (97.6%) accumulates specific lessons:
Click to toggle between the seed prompt and GEPA's optimized version for the multi-hop QA second-hop query module. Notice how GEPA adds concrete strategies learned from failure analysis.
GEPA sits at the intersection of several active research threads. Let's map where it fits.
GEPA builds on the DSPy framework's formalism of compound AI systems (Φ = (M, C, X, Y)). MIPROv2 is DSPy's prompt optimizer, focusing on joint instruction + few-shot optimization. GEPA outperforms MIPROv2 by replacing proposal-based optimization with reflection-based optimization and by using Pareto selection instead of Bayesian optimization.
TextGrad and Trace (with OptoPrime) are text-gradient methods that also use language feedback for optimization. But they use a greedy "select best candidate" strategy, which GEPA's Pareto ablation shows leads to +6.05% vs GEPA's +12.44%. GEPA also captures richer feedback via execution traces, not just output-level gradients.
GRPO and GEPA solve the same optimization problem (Equation 1) but in different parameter spaces. GRPO updates weights Θ via policy gradients from scalar rewards. GEPA updates prompts Π via natural language reflection. The key tradeoff: GRPO can teach genuinely new capabilities (as seen on AIME), but at 24K+ rollouts. GEPA is 35x more efficient for tasks where the LLM already has the capability but needs better instructions.
OPRO (Yang et al., 2023) was an early prompt optimizer using LLM-based optimization. GEPA advances this by adding structured execution trace reflection and Pareto-based candidate selection. Meta-Harness is related work on optimizing prompts for AI systems.
AlphaEvolve (Google DeepMind) also uses LLM-based evolutionary search for code optimization. GEPA shares the evolutionary framework but focuses on prompt optimization for compound AI systems, with the unique addition of reflective mutation from execution traces.
Beyond task adaptation, GEPA can be repurposed for inference-time code optimization. On NPUEval (AMD kernel generation), GEPA boosts GPT-4o from 4.25% to 30.52% vector utilization. On KernelBench (CUDA), it pushes near-0% fast1 score above 20%. The key: domain documentation is injected into the feedback function μf, surfacing relevant manual sections based on compiler errors.
| Aspect | GEPA |
|---|---|
| What it optimizes | Prompts ΠΦ (weights frozen) |
| Key mechanism | Reflective mutation from execution traces |
| Selection strategy | Pareto frontier (multi-objective) |
| Crossover | System-aware merge (combine best modules) |
| Works with | Any LLM (open or proprietary) |
| vs GRPO | +6% avg, up to +20%, 35x fewer rollouts |
| vs MIPROv2 | +10% aggregate, prompts 9.2x shorter |
| Best result | +12% on AIME-2025 (GPT-4.1 Mini) |
| Cross-model transfer | Qwen3 prompts → +9% on GPT-4.1 Mini |
| Min training rollouts | 6 rollouts to match GRPO on one task |
| Models tested | Qwen3-8B, GPT-4.1 Mini, GPT-4o |
| Tasks | HotpotQA, AIME, LiveBench, IFBench, PUPA, HoVer |