Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, Gabriel Synnaeve — Meta AI (FAIR), 2025

RLEF: Grounding Code LLMs in Execution Feedback

SOTA code LLMs can't improve code iteratively — refinement is worse than independent sampling. RLEF fixes this with end-to-end RL that teaches models to read compiler errors and fix code accordingly, achieving new SOTA on competitive programming with 10x fewer samples.

Prerequisites: What an LLM is + Basic RL intuition (reward, policy) + PPO at a high level
10
Chapters
4
Simulations

Chapter 0: The Problem

You give a code LLM a competitive programming problem. It writes a solution. The solution fails some test cases. You paste the error message back into the conversation. The model tries again.

Intuitively, this should work. The model saw the error. It knows what went wrong. It should be able to fix it. But here's the uncomfortable truth: state-of-the-art code LLMs are worse at iterative refinement than at simply generating independent solutions from scratch.

The refinement paradox: If you have a budget of 3 LLM calls, you're better off asking the model for 3 independent solutions than asking it to try once, read the error, and try twice more. The model that "sees" the error performs worse than the model that's blind to it. This holds for GPT-4, Llama 3.1, and essentially every instruction-tuned LLM tested.

Why? Because instruction-tuned models weren't trained to use execution feedback. They were trained to follow instructions and produce plausible text. When they see an error message, they often make minimal changes — sometimes literally copying the same code again — or make random edits unrelated to the actual error. They don't know how to parse "RuntimeError: list index out of range" and trace it back to the specific off-by-one bug in their loop.

Olausson et al. (2024) showed that large models are needed just to provide useful error analysis, and that multiple rounds of repair don't help. Kapoor et al. (2024) demonstrated that independent sampling beats self-repair when you account for the compute budget. The entire iterative refinement story was, practically speaking, broken.

Refinement vs Independent Sampling

Click "Run Experiment" to see how a base LLM performs when sampling independently vs. refining iteratively on the same problem. The refinement curve flatlines — the model doesn't improve from seeing errors.

Why are current code LLMs bad at iterative refinement despite seeing error messages?

Chapter 1: The Key Insight

If models can't use execution feedback because they weren't trained to, then train them to. Not with supervised examples of "here's an error, here's the fix." With reinforcement learning, where the only reward signal is: did your final code pass the tests?

This is the central idea of RLEF (Reinforcement Learning from Execution Feedback). Frame the entire multi-turn code generation process — write code, execute, read error, rewrite, execute again — as a single RL episode. The model's policy is its code generation behavior across all turns. The reward comes at the end: binary pass/fail on held-out test cases.

Why RL and not supervised fine-tuning? SFT requires ground-truth examples of good repair trajectories. Where do you get them? You'd need to mine successful multi-turn conversations from the model itself — but the model is bad at repair, so good examples are rare. RL sidesteps this by learning from the reward signal directly. The model discovers repair strategies through trial and error, guided by whether its final code actually works.

The paper shows this concretely: SFT on mined repair trajectories improves validation performance slightly but doesn't generalize to the test set. Few-shot prompting with repair examples actually hurts instruction-tuned models. Only RLEF produces robust, generalizable improvement.

The Two Skills RLEF Teaches

Skill 1: Task Alignment
Better first-attempt code generation. The model learns the specific domain (competitive programming) during RL, producing fewer errors on the initial try.
+
Skill 2: Feedback Grounding
Reading execution feedback and making targeted fixes. The model learns to parse error messages, identify root causes, and rewrite the faulty code — not just resample randomly.
Both skills matter. RLEF-trained models make fewer errors on turn 1 (task alignment) AND fix more errors on turns 2-3 (feedback grounding). The combination is what produces the 10x sample reduction over prior methods.
Why does RLEF use RL instead of supervised fine-tuning for teaching iterative repair?

Chapter 2: The Multi-Turn Loop

Let's walk through exactly what happens during one RLEF episode. The model gets a competitive programming problem and interacts with an execution environment over multiple turns.

Step by Step

1. Problem Prompt
The model receives a natural language problem description with example test cases (public tests). "Let's call a string beautiful if it does not contain a palindromic substring of length ≥ 2..."
2. Code Generation (Turn 1)
The model generates a Python solution. This is its first attempt — cold start, no prior feedback.
3. Public Test Execution
The generated code runs against public test cases. Three outcomes: all pass, some fail (with error details), or the code is syntactically invalid.
4a. If FAIL: Feedback → Retry
The execution result (error type, failed inputs/outputs, exception traceback) is formatted as text and appended to the conversation. The model generates a new solution with the full history visible.
4b. If PASS or Turn Limit
The current solution is submitted. It's evaluated against private test cases (the ones the model never sees). Binary reward: +1 if all private tests pass, -1 otherwise.
Two test sets, different roles. Public tests provide feedback (the model sees them). Private tests provide the reward (the model never sees them). This separation prevents the model from learning shortcuts like hard-coding expected outputs. It also means passing public tests doesn't guarantee correctness — the model must generalize.

A Concrete Example

From the paper's Figure 2:

Turn 1: Naive Solution
def min_cost_to_beautiful(s): while True: found = False; for length in range(len(s), 1, -1): ...
Result: "Execution took too long" — the brute-force approach times out on large inputs.
↓ feedback: TLE on input '5A baacb 13 15 23'
Turn 2: Optimized Solution
from functools import lru_cache; def is_beautiful(s): ...
The model reads "too long" and responds with memoization. Result: all public tests pass → submit.

The key observation: this error-to-fix pattern only emerges after RLEF training. Before RLEF, the model would typically output nearly identical code on turn 2, ignoring the timeout error.

Why does RLEF use separate public and private test sets?

Chapter 3: RLEF Training (Showcase)

Now let's look at the actual RL formulation. This is where RLEF gets interesting — the multi-turn code generation process is cast as a Markov Decision Process, and the model is optimized with PPO.

The MDP Formulation

The iterative code generation loop maps directly onto an MDP:

Partial observability. This is technically a POMDP — the reward depends on private tests the model can't see. The model observes only public test feedback but must produce code that generalizes to the full private test suite.

The Reward Function

At step t, with context ct = (o0, a0, o1, a1, ..., ot):

R(st, at) = r(st, at) − β log π(at|ct) / ρ(at|ct)

Where the task reward r is:

r(st, at) = +1 if episode ends and all tests pass
r(st, at) = −1 if episode ends and any test fails
r(st, at) = −0.2 if at does not contain valid code

The second term is a KL penalty that keeps the trained policy π from drifting too far from the initial policy ρ (the pre-trained Llama model). This serves as both entropy regularization and a guard against catastrophic forgetting. The constant β trades off task reward against staying close to the original model.

The −0.2 penalty for invalid code addresses a failure mode the authors discovered: without it, the model sometimes generates garbage text in non-final turns (since only the final turn's code matters for the reward). The small penalty encourages valid code at every step, which preserves the conversational structure needed for effective multi-turn interaction.

Token-Level Policy, Turn-Level Value

A subtle design choice: the policy operates at the token level (the LLM generates tokens one by one), but the value function operates at the turn level (it predicts the expected return from the last token of each prompt, before the model responds).

This hybrid approach outperformed both fully token-level and fully turn-level alternatives in ablations. The advantage for all tokens within a response is a single value:

At = −V(ct) + Σi=tT R(si, ai)

And the KL penalty uses the geometric mean of token probabilities rather than the product. This prevents a bias toward shorter generations — important because intermediate (non-final) turns need full, detailed code, not abbreviated responses.

RLEF Training Episode

Step through a training episode. Watch the model generate code, receive feedback, regenerate, and receive a reward. The reward propagates back through all turns via PPO.

Click to start episode
Why does RLEF use token-level policy but turn-level value function?

Chapter 4: Execution Feedback

The execution feedback is the bridge between "the code failed" and "here's why." RLEF's power comes from teaching models to parse this bridge. Let's look at what the model actually sees.

Types of Feedback

Error TypeWhat the Model SeesWhat It Means
OUTPUTExpected: "3 1 2", Got: "3 2 1"Code runs but produces wrong answer
EXCEPTIONRuntimeError: list index out of range at line 14Code crashes with a traceback
TIMEOUTExecution took too long (TLE)Algorithm too slow for the input size
OOMMemoryError: allocation exceeded limitAlgorithm uses too much memory

Each error type requires a different kind of fix. Wrong output means a logic bug. Exceptions mean a code bug. Timeouts mean the algorithm is wrong — you need a fundamentally different approach, not just a line fix. OOM is similar: the data structure is wrong.

The feedback template is structured. It includes: which test cases failed, what input was given, what output was expected vs. produced, and any exception traceback. This is formatted as natural language and inserted into the conversation. The model sees the full history: problem description, turn 1 code, turn 1 feedback, and must produce turn 2 code.

What Changes After RLEF

Figure 3 in the paper is striking. Before RLEF, looking at 5,640 rollouts (20 per problem):

The chrF evidence is damning. chrF measures character-level similarity between successive code outputs. Base models have chrF close to 1.0 — meaning they literally copy-paste their previous code with minimal changes. RLEF models have much lower chrF, indicating substantial code rewrites between turns. They're not making cosmetic tweaks; they're genuinely rewriting the solution.

Random Feedback Ablation

To prove the model actually reads the feedback (rather than just using the retry opportunity to sample a different solution), the authors run an ablation: replace real execution feedback with feedback from executing code on an unrelated problem.

Result: with random feedback, error recovery drops dramatically. This confirms that RLEF-trained models are genuinely parsing and using the error information, not just treating the retry as another independent sampling opportunity.

How did the authors prove that RLEF models actually read the execution feedback rather than just sampling randomly?

Chapter 5: Architecture

RLEF is intentionally simple in its architecture. There's no special scaffolding, no chain-of-thought modules, no external test generators. Just a standard LLM backbone, a conversation format, and PPO.

The LLM Backbone

The authors use Llama 3.1 Instruct models at 8B and 70B parameters. These are already instruction-tuned and capable of code generation — RLEF doesn't start from a raw pre-trained model. The instruction tuning provides the baseline conversational ability, and RLEF adds the multi-turn feedback-grounding capability on top.

No code-specific model needed. RLEF works on general instruction-tuned models. The authors also test on the older Llama 3.0 8B and get substantial improvements (4.1 → 12.5 on valid set 1@3), suggesting RLEF can partially substitute for instruction tuning on code tasks.

Training Setup

Parameter8B Model70B Model
PPO updates12,0008,000
Turn limit (training)33
Training dataCodeContests training set (~12,659 problems)
Checkpoint selectionBased on valid set performance
Sampling temperature0.2 (1@3), 1.0 (10@100)
KL penalty βTuned to balance task reward vs. distribution drift

Multi-Turn Context

The conversation format is straightforward. At each turn, the model's context window contains:

Position 1
Original problem description + public test examples
Position 2
Model's turn-1 code solution
Position 3
Execution feedback from turn 1 (formatted error messages)
Position 4
Model's turn-2 code solution (to be generated)

No retrieval augmentation. No chain-of-thought prompting. No external tool use beyond code execution. The simplicity is a feature: the model itself learns when and how to use feedback, rather than relying on scaffolding to structure the interaction.

Reward Design Choices

Three specific choices worth noting:

What is the primary advantage of RLEF's architectural simplicity (no scaffolding, no external tools)?

Chapter 6: Results

RLEF achieves new state-of-the-art results on CodeContests, a challenging competitive programming benchmark, with both the 8B and 70B Llama models.

CodeContests Results (1@3: one rollout, 3 turns)

ModelValidTest
AlphaCodium GPT-3.5 (5@100)2517
AlphaCodium GPT-4 (5@100)4429
MapCoder GPT-4 (1@19)28.5
Llama 3.1 8B Instruct (1@3)8.910.5
Llama 3.1 8B + RLEF (1@3)17.216.0
Llama 3.1 70B Instruct (1@3)25.927.5
Llama 3.1 70B + RLEF (1@3)37.540.1
The 70B RLEF model with 3 samples beats AlphaCodium GPT-4 with 100 samples on the test set (40.1 vs 29). That's a 10x reduction in sample budget to achieve superior performance. The 8B RLEF model (16.0 test, 3 samples) is competitive with AlphaCode 9B (13.3 test, 1000 samples).

Larger Budget: 10@100

With 100 samples (33 rollouts of 3 turns each), the improvements persist:

ModelValid 10@100Test 10@100
AlphaCode 41B + clustering (10@1000)21.016.4
Llama 3.1 70B Instruct50.250.3
Llama 3.1 70B + RLEF54.554.5

Generalization to Other Benchmarks

RLEF improvements on CodeContests transfer to other code benchmarks:

ModelHumanEval+ 1@3MBPP+ 1@3
70B Instruct (multi-turn)75.070.2
70B + RLEF (multi-turn)80.472.2
GPT-4o (multi-turn)80.771.7
The 70B RLEF model matches GPT-4o on HumanEval+ and MBPP+ in the multi-turn setting. And this is with a model fine-tuned only on CodeContests — the feedback-grounding capability generalizes across benchmarks with different problems and different feedback formats.
RLEF Results: Sample Efficiency

Compare solve rates across different sample budgets. RLEF models achieve higher accuracy with far fewer samples.

What is the most surprising result from the RLEF experiments?

Chapter 7: Behavior Analysis

The results show RLEF works. But how does it work? Do the models genuinely use feedback, or have they just gotten better at sampling diverse code? The paper's analysis section answers this definitively.

Evidence 1: Error Recovery Rates

Across 5,640 rollouts (20 per problem, valid + test), the authors count: how many errors occurred at turn 1, and how many were fixed by turns 2 and 3.

RLEF doesn't just fix easy bugs. The model learns to handle TIMEOUT errors — which require fundamentally changing the algorithm, not just patching a line. This is strong evidence that RLEF teaches genuine problem-solving, not template matching.

Evidence 2: Code Diversity (chrF)

chrF measures character n-gram overlap between successive code solutions. A chrF of 1.0 means the code is identical.

Evidence 3: Turn Limit Scaling

If the model truly uses feedback, performance should keep improving with more turns (up to a point). The authors test turn limits from 1 to 10:

5 turns is the sweet spot. RLEF-trained models can effectively leverage up to 5 rounds of feedback, far beyond the 3 turns used during training. This generalization to more turns than seen in training is remarkable — the model has learned a general repair skill, not just a 3-turn pattern.
Do Models Actually Use Feedback?

Compare code similarity (chrF) between successive turns. Base models barely change their code. RLEF models make substantial rewrites. Click to toggle.

Evidence 4: Random Feedback Ablation

The decisive test. Replace real feedback with feedback from executing code against an unrelated problem:

What does the chrF analysis reveal about RLEF-trained models vs base models?

Chapter 8: Comparisons

How does RLEF relate to the competitive landscape of code generation methods? Let's place it precisely.

vs. AlphaCode / AlphaCode 2

AlphaCode (Google DeepMind) generates up to 1,000,000 samples, clusters them, and picks one per cluster for submission. AlphaCode 2 improved this 10,000x in sample efficiency. RLEF's 70B model (37.5 valid, 10@100) beats AlphaCode 2's estimated performance (34.2 valid, 10@100) — and RLEF uses only 100 samples vs. potentially thousands for AlphaCode 2.

vs. AlphaCodium / MapCoder

These are agentic frameworks that chain many LLM calls: chain-of-thought planning, test generation, program repair, etc. They use GPT-4 as the backbone. RLEF's key advantage: all the scaffolding is replaced by end-to-end training. The model learns when to rewrite vs. patch, without explicit scaffolding.

MethodModelBudgetTest Solve Rate
AlphaCodiumGPT-45@10029.0
MapCoderGPT-41@1928.5
RLEFLlama 70B1@340.1
RLEF with 3 samples > AlphaCodium/MapCoder with 19-100 samples. And RLEF uses an open model (Llama 70B) vs. proprietary GPT-4. The trained model internalizes the scaffolding that these frameworks hard-code externally.

vs. Independent Sampling

Before RLEF, independent sampling was the rational strategy. After RLEF, multi-turn refinement strictly dominates independent sampling at every budget level. The reversal is complete: the model now benefits from seeing its errors, which is exactly what you'd expect an effective coder to do.

vs. SFT on Repair Trajectories

The paper includes an ablation comparing RLEF to supervised fine-tuning on mined repair trajectories (good multi-turn rollouts extracted from the 70B model):

MethodValid 1@3Test 1@3
Few-shot (8B)8.58.5
SFT (8B)10.310.0
RLEF (8B)17.216.0
SFT (70B)27.727.2
RLEF (70B)37.540.1

RLEF outperforms SFT by a large margin at both scales. The RL reward signal provides learning signal that imitation cannot.

vs. Dedicated Repair Model

An alternative approach: train one model for generation (single-turn) and a separate model for repair. The paper tests this (Appendix B.3): the two-model system achieves 14.8 valid / 12.6 test (8B), well below the single RLEF model at 17.2 / 16.0. End-to-end training beats the modular approach.

Why does RLEF outperform agentic frameworks like AlphaCodium despite using far fewer samples and no scaffolding?

Chapter 9: Connections

RLEF connects several research threads. Let's map where it fits.

Relation to RLHF

RLHF (Reinforcement Learning from Human Feedback) uses a reward model trained on human preferences. RLEF uses execution results — an automatic reward signal. This is significant: code execution provides a perfect, zero-noise reward. You never need to train a reward model or worry about reward hacking. The code either passes the tests or it doesn't.

Relation to Self-Correction (SCoRe)

Concurrent work by Kumar et al. (2024) proposes SCoRe, a two-stage RL method for self-correction that doesn't use execution feedback — it just asks the model to "reconsider." RLEF's advantage: the execution feedback provides concrete, actionable information about what went wrong. "Reconsider your answer" is weaker than "your code timed out on this input."

Relation to CodeRL / StepCoder

Prior RL-for-code work (Le et al., 2022; Dou et al., 2024) used execution rewards for single-turn generation. RLEF extends this to the multi-turn setting, which is both more natural (humans iterate on code) and more effective (the model can fix specific errors).

Relation to Agent Architectures

RLEF can be viewed as training an agent that interacts with a code execution environment. The key insight: rather than building scaffolding that structures the interaction (like AlphaCodium), RLEF trains the model to structure its own interaction. This "learned scaffolding" is more flexible and more sample-efficient.

Limitations and Open Questions

Cheat Sheet

AspectRLEF
What it teachesReading execution feedback and fixing code iteratively
RL algorithmPPO with KL penalty, binary reward from private tests
Training turns3 (generalizes to 5 at inference)
8B improvement8.9 → 17.2 (valid), 10.5 → 16.0 (test)
70B improvement25.9 → 37.5 (valid), 27.5 → 40.1 (test)
vs. AlphaCodium GPT-440.1 with 3 samples vs. 29 with 100
Sample reduction~10x or more vs. prior SOTA
Key evidenceRandom feedback ablation proves genuine feedback use
GeneralizationHumanEval+, MBPP+, and more turns than training
The broader lesson: When automatic evaluation is available, end-to-end RL beats both scaffolding and supervised fine-tuning. The model should learn to use feedback, not have feedback use scripted for it. RLEF is a concrete demonstration that the "iterative refinement" workflow — write, test, debug, repeat — can be taught rather than hand-coded.
What is the key advantage of RLEF's execution-based reward over RLHF's human-preference reward for code generation?