GEPA — Veanors

Chapter 0: The Problem

You have a multi-hop question-answering system. It uses an LLM to generate a search query, retrieves documents, then uses the LLM again to synthesize an answer. The system works, but it only gets 42% accuracy on HotpotQA. You want to improve it.

The standard approach in 2025–2026 is reinforcement learning. Specifically, GRPO (Group Relative Policy Optimization) — you run the system thousands of times, observe which runs succeed and which fail, and use those scalar rewards to update the model weights via policy gradients.

There is a brutal cost. GRPO needs 24,000 rollouts to learn a task. Each rollout means running your full AI system end-to-end — LLM calls, tool invocations, retrieval, the works. If each rollout costs $0.05 in API calls, that is $1,200 per task. If you are using a proprietary model like GPT-4.1, you cannot even update the weights at all.

The core tension: RL collapses rich execution traces — all the reasoning, the tool calls, the error messages, the intermediate outputs — into a single scalar: 0 (wrong) or 1 (right). That is like reading a student's 10-page exam, throwing away everything except the final grade, and trying to teach them from that alone. All the diagnostic information about what went wrong and why is lost.

What if there were a way to learn from those traces directly? To read what the system did, understand why it failed, and fix the prompt accordingly — all in natural language, with no weight updates needed?

That is the question GEPA answers.

Rollout Efficiency: GRPO vs GEPA

Drag the budget slider to see how many rollouts each method needs. GEPA reaches strong performance with a fraction of GRPO's budget. The dashed line shows GRPO's final score after 24K rollouts.

Rollout Budget 5,000

Why is GRPO sample-inefficient for adapting compound AI systems?

It collapses rich natural-language execution traces into scalar rewards, discarding all diagnostic information about what went wrong and why It requires too much GPU memory to store the model weights It can only optimize single-module systems, not multi-module pipelines

Chapter 1: The Key Insight

Here is the observation that makes GEPA possible: LLM execution traces are natural language.

When a compound AI system runs, it produces a stream of text: the system prompt, the LLM's chain-of-thought reasoning, the tool calls it makes, the tool outputs it receives, the intermediate results it passes between modules, and — crucially — the evaluation feedback (compiler errors, failed rubrics, mismatched answers).

All of this is text. And modern LLMs are extraordinarily good at reading text, diagnosing problems in it, and proposing fixes.

RL Approach (GRPO)

Run system → get output → compare to gold answer → reward = 0 or 1 → policy gradient update to model weights. The reasoning trace is discarded.

↓ vs ↓

GEPA Approach

Run system → capture full execution trace (reasoning + tool calls + errors) → LLM reads trace → diagnoses what went wrong → proposes specific prompt changes. The trace IS the learning signal.

Why this is more informative: A scalar reward tells you "the system failed." A trace tells you "the system failed because the second-hop query was too vague — it repeated the original question instead of targeting the missing entity mentioned in the first-hop summary." The second signal lets you make a targeted fix: "add an instruction to the query-generation prompt that says: identify entities in summary_1 that aren't covered by the first hop, and search for those."

This is what the authors call reflective mutation — using an LLM to reflect on execution traces and mutate prompts based on that reflection. It is the core mechanism of GEPA, and it is why GEPA can learn from as few as 6 training rollouts on some tasks.

The formal claim: Natural language traces, paired with structured feedback from the evaluation function, provide a much richer learning signal than scalar rewards. This lets GEPA be up to 35x more sample-efficient than GRPO. The key: the LLM performing the reflection has strong language priors — it already understands what good instructions look like, what common failure modes are, and how to fix them.

What makes execution traces a richer learning signal than scalar rewards?

They contain the full reasoning chain, tool calls, intermediate outputs, and evaluation feedback — letting an LLM diagnose exactly what went wrong and propose targeted prompt fixes They are shorter and cheaper to process than reward signals They allow gradient computation through the execution trace

Chapter 2: Compound AI Systems

Before we can optimize prompts, we need a precise definition of what we are optimizing. GEPA targets compound AI systems — not single LLM calls, but multi-module pipelines with control flow, tools, and multiple prompts.

Formal Definition

A compound AI system is a tuple Φ = (M, C, X, Y) where:

M = ⟨M₁, ..., M_|M|⟩ are the language modules. Each module M_i = (π_i, θ_i, X_i, Y_i) has a prompt π_i, model weights θ_i, and input/output schemas.
C is the control flow — the logic that orchestrates modules. It decides which module runs next, passes outputs between modules, invokes tools, and handles branching.
X, Y are the global input/output schemas.

Concrete example — multi-hop QA system: Module M₁ generates a first-hop search query from the question. The control flow C calls a retrieval tool, gets documents, summarizes them. Module M₂ generates a second-hop query using the question + first-hop summary. Control flow retrieves again. Module M₃ synthesizes the final answer from all retrieved information. Each module has its own prompt π_i — GEPA can optimize all of them.

The Optimization Target

Let Π_Φ = ⟨π₁, ..., π_|M|⟩ be the collection of all prompts and Θ_Φ = ⟨θ₁, ..., θ_|M|⟩ the model weights. For a task instance (x, m) with input x and evaluator metadata m (gold answers, rubrics), the system produces output y = Φ(x; ⟨Π, Θ⟩_Φ). A metric μ : Y × M → [0,1] scores the output quality.

⟨Π*, Θ*⟩_Φ = arg max_⟨Π,Θ⟩ E_(x,m)~T [ μ( Φ(x; ⟨Π, Θ⟩_Φ), m ) ]

GEPA focuses on optimizing only the prompts Π_Φ, keeping the weights Θ_Φ frozen. This is what makes it work with any LLM — including proprietary APIs where weight updates are impossible.

The Budget Constraint

Rollouts are expensive. The optimizer is limited to at most B rollouts. The real challenge: extract maximal learning signal from every single rollout.

⟨Π*, Θ*⟩_Φ = arg max_⟨Π,Θ⟩ E_(x,m)~T [ μ( Φ(x; ⟨Π, Θ⟩_Φ), m ) ], s.t. #rollouts ≤ B

Why compound systems make this harder (and more interesting): With a single-module system, there is one prompt to optimize. With a compound system, there are |M| prompts, and they interact. Improving the query-generation prompt might make the answer-synthesis prompt work worse if it was calibrated to the old query style. GEPA handles this by selecting which module to update (round-robin policy) and evaluating the whole system end-to-end after each change.

In the formal definition, what does GEPA optimize and what does it keep fixed?

It optimizes the prompt collection Π_Φ (all module prompts) while keeping the model weights Θ_Φ frozen — no gradient updates, works with any LLM including proprietary APIs It optimizes both prompts and weights simultaneously using policy gradients It optimizes only the control flow C while keeping prompts and weights fixed

Chapter 3: Reflective Mutation

This is the heart of GEPA. Reflective mutation is the process by which GEPA reads an execution trace, diagnoses what went wrong, and proposes a specific prompt change. Let's walk through it step by step.

Step 1: Execute and Trace

GEPA runs the current candidate system on a minibatch of training examples. It captures the full execution trace: every module's input, the LLM's reasoning, tool calls, tool outputs, and final answer. It also calls the feedback function μ_f, which returns both a numeric score and text feedback (e.g., compiler errors, which rubrics failed, the gold answer for comparison).

Step 2: Select a Module

GEPA selects which module to update via a round-robin policy. In a 3-module system, it cycles through M₁, M₂, M₃, M₁, ... This ensures all modules get optimized, not just the one that happens to fail most visibly.

Step 3: Reflect

A reflection LLM is shown the current prompt, the execution trace for the selected module, the score, and the text feedback. Its task: diagnose what went wrong and propose specific changes to the prompt.

What the reflection LLM sees: (1) The current prompt for the module being optimized. (2) The full trajectory: what the module received as input, what it reasoned, what it outputted. (3) The final score and text feedback (e.g., "Expected answer: 'Madeira archipelago', Got: 'Funchal municipality' — the second-hop query was too narrow, searching for the parish instead of the broader region"). (4) The accumulated lessons from the candidate's ancestry — what was tried before and what worked.

Step 4: Mutate

Based on the reflection, the LLM proposes an updated prompt. This is not a random perturbation — it is a targeted edit informed by concrete failure analysis. The new prompt inherits all the lessons from its parent and adds the new insight.

Step 5: Evaluate

The updated system is re-run on the same minibatch. If the score improves, the new candidate is added to the pool. If not, it is discarded. No wasted budget on bad ideas.

Two sources of diagnostic signal: GEPA identifies two distinct types of traces. Execution traces are what the LLM produces (reasoning, tool calls). Evaluation traces are what the environment produces when computing the reward (compiler errors, failed test cases, mismatched rubrics). Both are natural language. Both are fed to the reflection LLM. Many evaluation metrics produce rich text before collapsing to a scalar — GEPA captures that text before the collapse.

Reflective Mutation in Action

Click "Run System" to see an execution trace, then "Reflect" to see the LLM's diagnosis, then "Mutate" to see the resulting prompt change.

Why reflection beats random mutation: Without reflection, you could still do evolutionary prompt optimization — randomly perturb prompts and keep the best. But random mutation has no idea what to change. It might flip irrelevant words while ignoring the actual problem. Reflective mutation is targeted: it reads the failure, understands the cause, and surgically fixes the prompt. This is why GEPA can match GRPO's performance with as few as 6 training rollouts on some tasks (102, 32, 6, and 179 training rollouts across four tasks).

What makes reflective mutation fundamentally different from random prompt perturbation?

The reflection LLM reads the full execution trace and evaluation feedback, diagnoses the specific failure cause, and proposes a targeted prompt edit — rather than blindly perturbing words It uses a larger language model for the mutation step It applies mutations to multiple modules simultaneously rather than one at a time

Chapter 4: The Pareto Frontier

Reflective mutation is powerful, but it has a failure mode: local optima. If you always evolve the best-performing candidate, you get trapped. You find one good strategy, keep refining it, and exhaust your budget without discovering a fundamentally different (and better) approach.

Figure 6 in the paper shows this vividly. With a greedy "always pick the best" strategy (like TextGrad uses), the optimization tree degenerates into a long chain — one child after another, each a minor tweak of the same idea. After the first improvement, the search stalls and burns rollouts trying to squeeze out more from a single strategy.

Pareto-Based Candidate Selection

GEPA's solution is elegant: instead of tracking one "best" candidate, it maintains a Pareto frontier across all training instances.

Here is how it works. For each training example i, GEPA records the score of every candidate. A candidate is on the Pareto frontier if it achieves the best score on at least one training example. Candidates that are strictly dominated (beaten by another candidate on every example) are pruned.

Concrete example: Suppose you have 3 training examples and 4 candidates. Candidate A scores [0.8, 0.3, 0.5]. Candidate B scores [0.5, 0.9, 0.4]. Candidate C scores [0.4, 0.4, 0.95]. Candidate D scores [0.3, 0.3, 0.3]. Candidates A, B, and C are all on the Pareto frontier — each is the best at something. D is dominated (beaten on all examples) and gets pruned. GEPA will sample from {A, B, C}, weighted by how many examples each leads on.

Why Pareto Prevents Local Optima

A candidate that solves a specific type of problem well might encode a strategy that the globally best candidate lacks. By keeping it in the pool and sometimes evolving from it, GEPA explores diverse regions of prompt space. The search tree becomes a tree, not a chain.

Ablation proof: Table 3 in the paper compares three selection strategies on Qwen3 8B across four tasks. SelectBestCandidate (greedy, used by TextGrad): +6.05% aggregate improvement. BeamSearch (top-N pool, used by APO): +5.11%. GEPA's Pareto-based selection: +12.44%. That is more than double the greedy strategy. On PUPA specifically, Pareto gives +11.03% where greedy gives only +4.63%.

Sampling from the Frontier

When GEPA needs to select a candidate to mutate, it:

Computes the Pareto frontier (non-dominated candidates)
Counts how many training examples each frontier candidate leads on (call this f[Φ])
Samples a candidate with probability proportional to f[Φ]

Candidates that lead on many examples get sampled more often (exploitation), but candidates that lead on even just one example still get a chance (exploration). This naturally balances the exploration–exploitation tradeoff.

Why does GEPA use Pareto-based selection instead of always evolving the best candidate?

Always picking the best leads to local optima — you refine one strategy endlessly while missing fundamentally different approaches. Pareto selection keeps diverse candidates alive, doubling the aggregate improvement (+12.44% vs +6.05%) Because Pareto selection is computationally cheaper than ranking all candidates Because it allows GEPA to use fewer total rollouts per iteration

Chapter 5: The Full Algorithm

Now let's assemble the pieces. GEPA's optimization loop combines reflective mutation with Pareto-based candidate selection in a genetic framework.

Initialization

GEPA starts with a single candidate: the base system Φ with its seed prompts. The training data D_train is split into D_feedback (used for rollouts and reflection) and D_pareto (used for scoring candidates on the Pareto frontier). The candidate pool P contains only the base system initially.

The Loop (while budget B > 0)

1. Select Candidate

Use Pareto-based selection (Algorithm 2) to choose a candidate Φ_k from the frontier. Candidates leading on more examples are sampled more often.

↓

2. Select Module

Choose which module j to update (round-robin across the |M| modules).

↓

3. Execute + Gather Feedback

Run Φ_k on a minibatch M from D_feedback. Trace the execution. Call μ_f for scores and text feedback.

↓

4. Reflective Mutation

Show the reflection LLM the (prompt, trace, score, feedback). It proposes an updated prompt π_j'.

↓

5. Minibatch Check

Re-run the updated system on the same minibatch. If the score improved, proceed. If not, discard and loop back.

↓

6. Full Evaluation

Evaluate the new candidate on all of D_pareto. Add it to the pool P with ancestry records.

System-Aware Merge (Crossover)

In addition to mutation, GEPA supports a crossover operation called System-Aware Merge. When two candidates have evolved different modules independently (one improved the query module, another improved the synthesis module), Merge combines them by taking the best version of each module from each lineage. This is unique to compound AI systems — you can't do this with single-prompt optimization.

The complete data flow: System Φ(x) → trajectory τ (reasoning + tool calls + outputs) → reflection (natural language diagnosis) → mutation (prompt edit) → evaluation (metric μ) → Pareto update. Each iteration uses a small number of rollouts (minibatch + validation), and the accumulated lessons flow through the genetic tree via ancestry records.

Termination

When the budget is exhausted, GEPA returns the candidate with the best aggregate score on D_pareto.

Budget allocation: The majority of GEPA's rollout budget goes to validation (scoring candidates on D_pareto for Pareto selection), not to generating learning signals. The actual training rollouts (used for reflection) can be as few as 6–179 across four tasks. If we count only training rollouts, GEPA matches GRPO's best validation with 102, 32, 6, and 179 rollouts respectively — up to 78x more efficient.

What happens when a reflective mutation does NOT improve the minibatch score?

The new candidate is discarded immediately — no budget is wasted evaluating it on the full D_pareto set. Only improvements survive to the pool It is added to the pool anyway with a penalty score The reflection LLM is called again with a higher temperature

Chapter 6: Results vs GRPO

The headline result: across six benchmarks on Qwen3 8B, GEPA outperforms GRPO (24K rollouts) by +6% average and up to +20%, while using up to 35x fewer rollouts.

Task-by-Task on Qwen3 8B

Task	Baseline	GRPO (24K)	GEPA	GEPA Rollouts
HotpotQA	42.3%	43.3%	62.3%	6,871
IFBench	36.9%	35.9%	38.6%	3,593
HoVer	35.3%	38.7%	52.3%	7,051
PUPA	80.8%	86.7%	91.9%	2,426
AIME-2025	27.3%	38.0%	32.0%	1,839
LiveBench-Math	48.7%	51.3%	52.0%	1,839

The HotpotQA result is striking: GRPO barely moves the needle (+1% over baseline with 24K rollouts). GEPA jumps +20% with 6,871 rollouts. The multi-hop QA system has multiple modules and complex interactions — exactly the setting where reflective diagnosis of traces shines. GRPO's scalar reward cannot tell the system which hop went wrong.

AIME-2025 is the one exception where GRPO outperforms GEPA. This is a math reasoning task where the primary bottleneck is the LLM's mathematical ability, not its instruction-following. Weight updates (which GRPO does) can teach new mathematical reasoning patterns; prompt changes alone cannot. This is an honest limitation of prompt optimization.

Sample Efficiency Deep Dive

To match GRPO's best validation scores, GEPA needs only:

243 rollouts on HotpotQA (vs 24K for GRPO — 99x efficiency)
402 rollouts on IFBench
330 rollouts on HoVer
1,143 rollouts on PUPA

If we count only the training rollouts (not validation), the numbers are even more dramatic: 102, 32, 6, and 179 rollouts.

Learning Curves: GEPA vs GRPO vs MIPROv2

Toggle between tasks to see how each method's performance improves with rollout budget. GEPA (orange) reaches high performance early; GRPO (blue) climbs slowly over 24K rollouts.

On which task does GRPO outperform GEPA, and why?

AIME-2025 — because math reasoning requires new computational abilities that weight updates can teach but prompt changes alone cannot HotpotQA — because multi-hop reasoning benefits from RL exploration IFBench — because instruction following requires fine-grained weight adjustments

Chapter 7: Results vs MIPROv2

MIPROv2 is the prior state-of-the-art prompt optimizer (from the DSPy ecosystem). It optimizes both instructions and few-shot demonstrations jointly. GEPA outperforms it on every benchmark and every model.

Qwen3 8B Comparison

Method	Aggregate Score	Improvement over Baseline
Baseline	45.2%	—
MIPROv2	47.8%	+2.6%
GRPO (24K)	48.9%	+3.7%
GEPA	54.8%	+9.6%

GPT-4.1 Mini Comparison

Method	Aggregate Score	Improvement
Baseline	53.0%	—
Trace (OptoPrime)	56.3%	+3.3%
MIPROv2	58.7%	+5.6%
TextGrad	59.1%	+6.1%
GEPA	65.2%	+12.2%
GEPA+Merge	66.4%	+13.3%

AIME-2025 spotlight: On AIME-2025 with GPT-4.1 Mini, GEPA achieves 59.3% vs MIPROv2's 51.3% — a +8 point gap. On the same benchmark with Qwen3 8B, GEPA gets 32.0% vs MIPROv2's 20.0% — a +12 point gap. Prompt optimization alone pushes a small open-source model to solve nearly a third of competition-level math problems.

Why GEPA Beats MIPROv2

Two key differences explain the gap:

Instructions vs demonstrations: MIPROv2 relies heavily on few-shot demonstrations, which are long and specific. GEPA's reflectively evolved instructions are up to 9.2x shorter than MIPROv2's prompts while performing better. Shorter prompts mean lower cost, lower latency, and better generalization.
Reflection vs proposal: MIPROv2 proposes instructions based on task descriptions. GEPA proposes instructions based on observed failures — it has seen what goes wrong and fixes specifically that.

Cross-model generalization: Here is a remarkable finding. GEPA-optimized prompts for Qwen3-8B, when evaluated on GPT-4.1 Mini without modification, achieve +9% aggregate improvement. This outperforms MIPROv2 (+5.6%), TextGrad (+6.1%), and Trace (+3.3%) — all of which were optimized directly on GPT-4.1 Mini. The prompts learned from a weaker model transfer to a stronger one because they encode strategies, not model-specific tricks.

Why are GEPA's prompts more efficient than MIPROv2's despite achieving higher performance?

GEPA produces concise declarative instructions (up to 9.2x shorter) based on observed failure analysis, while MIPROv2 relies on long few-shot demonstrations — shorter prompts generalize better and cost less GEPA uses a more powerful language model for prompt generation GEPA compresses MIPROv2's prompts using summarization

Chapter 8: What Optimized Prompts Look Like

One of the most compelling aspects of GEPA is that its outputs are human-readable. Unlike RL-optimized weights (which are inscrutable floating-point tensors), GEPA's optimized prompts are natural language instructions that you can read, understand, and even learn from.

Example: Multi-Hop QA (HotpotQA)

The seed prompt for the second-hop query module is minimal:

Seed prompt: "Given the fields question, summary_1, produce the fields query."

After GEPA's optimization, this becomes a rich, multi-section instruction:

GEPA's optimized prompt (excerpt): "You will be given two input fields: question and summary_1. Your task: Generate a new search query optimized for the second hop of a multi-hop retrieval system. The original user question is typically complex and requires information from multiple documents. Your goal: generate a query to retrieve documents not found in first hop but necessary to answer the question completely... Identify entities or topics mentioned in summary_1 that are related but different from first-hop documents. Reframe the query to explicitly mention these broader or related entities... Ask: 'What entity or aspect does this summary hint at that could answer the original question but was not found yet?'"

What GEPA Learned

Look at what the optimization discovered: the seed prompt gives zero guidance. GEPA's prompt contains specific strategies derived from observing failures:

"Target broader regions, not parishes" — learned from a failure where the query searched for a parish instead of the archipelago
"Don't paraphrase the original question" — learned from failures where the second-hop query was redundant with the first
"Look for entities mentioned in summary_1 that aren't in the original question" — a general strategy discovered through multiple reflections

Evolution Trajectory: PUPA (Privacy Task)

Figure 5 in the paper shows the optimization tree for the PUPA privacy delegation task. Each node is a candidate with its score. The progression from seed (82.3%) to best (97.6%) accumulates specific lessons:

Candidate 0 (82.3%): Base instruction — "create a privacy-preserving request"
Candidate 3 (87.8%): Adds detailed PII identification and generalization strategies
Candidate 4 (94.4%): Formalizes output into Reasoning + Request sections. Bans names/codes explicitly.
Candidate 5 (94.7%): Adds transparent transformation rationale. Handles edge cases (fictional characters, professional scenarios).
Candidate 11 (97.6%): Rigorous stepwise protocol. Zero-leakage tolerance. Auditable privacy guarantees.

Before vs After: Prompt Comparison

Click to toggle between the seed prompt and GEPA's optimized version for the multi-hop QA second-hop query module. Notice how GEPA adds concrete strategies learned from failure analysis.

Prompts as knowledge artifacts: GEPA's optimized prompts are not just instructions — they are distilled knowledge. Each sentence encodes a lesson learned from specific failures. A human reading the optimized prompt would learn the same strategies that took GEPA hundreds of rollouts to discover. This is a unique advantage over weight-space optimization: the learned knowledge is interpretable.

What kind of knowledge does GEPA embed in its optimized prompts?

Specific task strategies learned from observed failures — concrete instructions like "target broader regions" and "don't paraphrase the original question" that address real failure modes Statistical patterns extracted from the training data distribution Compressed representations of the few-shot examples from the training set

Chapter 9: Connections

GEPA sits at the intersection of several active research threads. Let's map where it fits.

Relation to DSPy / MIPROv2

GEPA builds on the DSPy framework's formalism of compound AI systems (Φ = (M, C, X, Y)). MIPROv2 is DSPy's prompt optimizer, focusing on joint instruction + few-shot optimization. GEPA outperforms MIPROv2 by replacing proposal-based optimization with reflection-based optimization and by using Pareto selection instead of Bayesian optimization.

Relation to TextGrad / Trace

TextGrad and Trace (with OptoPrime) are text-gradient methods that also use language feedback for optimization. But they use a greedy "select best candidate" strategy, which GEPA's Pareto ablation shows leads to +6.05% vs GEPA's +12.44%. GEPA also captures richer feedback via execution traces, not just output-level gradients.

Relation to GRPO / RLVR

GRPO and GEPA solve the same optimization problem (Equation 1) but in different parameter spaces. GRPO updates weights Θ via policy gradients from scalar rewards. GEPA updates prompts Π via natural language reflection. The key tradeoff: GRPO can teach genuinely new capabilities (as seen on AIME), but at 24K+ rollouts. GEPA is 35x more efficient for tasks where the LLM already has the capability but needs better instructions.

Relation to OPRO / Meta-Harness

OPRO (Yang et al., 2023) was an early prompt optimizer using LLM-based optimization. GEPA advances this by adding structured execution trace reflection and Pareto-based candidate selection. Meta-Harness is related work on optimizing prompts for AI systems.

Relation to AlphaEvolve

AlphaEvolve (Google DeepMind) also uses LLM-based evolutionary search for code optimization. GEPA shares the evolutionary framework but focuses on prompt optimization for compound AI systems, with the unique addition of reflective mutation from execution traces.

GEPA as Inference-Time Search

Beyond task adaptation, GEPA can be repurposed for inference-time code optimization. On NPUEval (AMD kernel generation), GEPA boosts GPT-4o from 4.25% to 30.52% vector utilization. On KernelBench (CUDA), it pushes near-0% fast1 score above 20%. The key: domain documentation is injected into the feedback function μ_f, surfacing relevant manual sections based on compiler errors.

Cheat Sheet

Aspect	GEPA
What it optimizes	Prompts Π_Φ (weights frozen)
Key mechanism	Reflective mutation from execution traces
Selection strategy	Pareto frontier (multi-objective)
Crossover	System-aware merge (combine best modules)
Works with	Any LLM (open or proprietary)
vs GRPO	+6% avg, up to +20%, 35x fewer rollouts
vs MIPROv2	+10% aggregate, prompts 9.2x shorter
Best result	+12% on AIME-2025 (GPT-4.1 Mini)
Cross-model transfer	Qwen3 prompts → +9% on GPT-4.1 Mini
Min training rollouts	6 rollouts to match GRPO on one task
Models tested	Qwen3-8B, GPT-4.1 Mini, GPT-4o
Tasks	HotpotQA, AIME, LiveBench, IFBench, PUPA, HoVer

The broader lesson: When your system produces rich natural-language traces, don't throw them away by collapsing to scalar rewards. Read them. Diagnose them. Learn from them. GEPA shows that this simple principle — reflection in language — can outperform sophisticated RL with a fraction of the data.

When should you use GRPO over GEPA?

When the task requires lots of few-shot demonstrations When you need to optimize both prompts and model weights simultaneously When the bottleneck is the LLM's core capability (e.g., mathematical reasoning) rather than its instruction-following — weight updates can teach new abilities that prompt changes alone cannot

GEPA: Reflective Prompt Evolution