SOTA code LLMs can't improve code iteratively — refinement is worse than independent sampling. RLEF fixes this with end-to-end RL that teaches models to read compiler errors and fix code accordingly, achieving new SOTA on competitive programming with 10x fewer samples.
You give a code LLM a competitive programming problem. It writes a solution. The solution fails some test cases. You paste the error message back into the conversation. The model tries again.
Intuitively, this should work. The model saw the error. It knows what went wrong. It should be able to fix it. But here's the uncomfortable truth: state-of-the-art code LLMs are worse at iterative refinement than at simply generating independent solutions from scratch.
Why? Because instruction-tuned models weren't trained to use execution feedback. They were trained to follow instructions and produce plausible text. When they see an error message, they often make minimal changes — sometimes literally copying the same code again — or make random edits unrelated to the actual error. They don't know how to parse "RuntimeError: list index out of range" and trace it back to the specific off-by-one bug in their loop.
Olausson et al. (2024) showed that large models are needed just to provide useful error analysis, and that multiple rounds of repair don't help. Kapoor et al. (2024) demonstrated that independent sampling beats self-repair when you account for the compute budget. The entire iterative refinement story was, practically speaking, broken.
Click "Run Experiment" to see how a base LLM performs when sampling independently vs. refining iteratively on the same problem. The refinement curve flatlines — the model doesn't improve from seeing errors.
If models can't use execution feedback because they weren't trained to, then train them to. Not with supervised examples of "here's an error, here's the fix." With reinforcement learning, where the only reward signal is: did your final code pass the tests?
This is the central idea of RLEF (Reinforcement Learning from Execution Feedback). Frame the entire multi-turn code generation process — write code, execute, read error, rewrite, execute again — as a single RL episode. The model's policy is its code generation behavior across all turns. The reward comes at the end: binary pass/fail on held-out test cases.
The paper shows this concretely: SFT on mined repair trajectories improves validation performance slightly but doesn't generalize to the test set. Few-shot prompting with repair examples actually hurts instruction-tuned models. Only RLEF produces robust, generalizable improvement.
Let's walk through exactly what happens during one RLEF episode. The model gets a competitive programming problem and interacts with an execution environment over multiple turns.
From the paper's Figure 2:
def min_cost_to_beautiful(s): while True: found = False; for length in range(len(s), 1, -1): ...from functools import lru_cache; def is_beautiful(s): ...The key observation: this error-to-fix pattern only emerges after RLEF training. Before RLEF, the model would typically output nearly identical code on turn 2, ignoring the timeout error.
Now let's look at the actual RL formulation. This is where RLEF gets interesting — the multi-turn code generation process is cast as a Markov Decision Process, and the model is optimized with PPO.
The iterative code generation loop maps directly onto an MDP:
At step t, with context ct = (o0, a0, o1, a1, ..., ot):
Where the task reward r is:
The second term is a KL penalty that keeps the trained policy π from drifting too far from the initial policy ρ (the pre-trained Llama model). This serves as both entropy regularization and a guard against catastrophic forgetting. The constant β trades off task reward against staying close to the original model.
A subtle design choice: the policy operates at the token level (the LLM generates tokens one by one), but the value function operates at the turn level (it predicts the expected return from the last token of each prompt, before the model responds).
This hybrid approach outperformed both fully token-level and fully turn-level alternatives in ablations. The advantage for all tokens within a response is a single value:
And the KL penalty uses the geometric mean of token probabilities rather than the product. This prevents a bias toward shorter generations — important because intermediate (non-final) turns need full, detailed code, not abbreviated responses.
Step through a training episode. Watch the model generate code, receive feedback, regenerate, and receive a reward. The reward propagates back through all turns via PPO.
The execution feedback is the bridge between "the code failed" and "here's why." RLEF's power comes from teaching models to parse this bridge. Let's look at what the model actually sees.
| Error Type | What the Model Sees | What It Means |
|---|---|---|
| OUTPUT | Expected: "3 1 2", Got: "3 2 1" | Code runs but produces wrong answer |
| EXCEPTION | RuntimeError: list index out of range at line 14 | Code crashes with a traceback |
| TIMEOUT | Execution took too long (TLE) | Algorithm too slow for the input size |
| OOM | MemoryError: allocation exceeded limit | Algorithm uses too much memory |
Each error type requires a different kind of fix. Wrong output means a logic bug. Exceptions mean a code bug. Timeouts mean the algorithm is wrong — you need a fundamentally different approach, not just a line fix. OOM is similar: the data structure is wrong.
Figure 3 in the paper is striking. Before RLEF, looking at 5,640 rollouts (20 per problem):
To prove the model actually reads the feedback (rather than just using the retry opportunity to sample a different solution), the authors run an ablation: replace real execution feedback with feedback from executing code on an unrelated problem.
Result: with random feedback, error recovery drops dramatically. This confirms that RLEF-trained models are genuinely parsing and using the error information, not just treating the retry as another independent sampling opportunity.
RLEF is intentionally simple in its architecture. There's no special scaffolding, no chain-of-thought modules, no external test generators. Just a standard LLM backbone, a conversation format, and PPO.
The authors use Llama 3.1 Instruct models at 8B and 70B parameters. These are already instruction-tuned and capable of code generation — RLEF doesn't start from a raw pre-trained model. The instruction tuning provides the baseline conversational ability, and RLEF adds the multi-turn feedback-grounding capability on top.
| Parameter | 8B Model | 70B Model |
|---|---|---|
| PPO updates | 12,000 | 8,000 |
| Turn limit (training) | 3 | 3 |
| Training data | CodeContests training set (~12,659 problems) | |
| Checkpoint selection | Based on valid set performance | |
| Sampling temperature | 0.2 (1@3), 1.0 (10@100) | |
| KL penalty β | Tuned to balance task reward vs. distribution drift | |
The conversation format is straightforward. At each turn, the model's context window contains:
No retrieval augmentation. No chain-of-thought prompting. No external tool use beyond code execution. The simplicity is a feature: the model itself learns when and how to use feedback, rather than relying on scaffolding to structure the interaction.
Three specific choices worth noting:
RLEF achieves new state-of-the-art results on CodeContests, a challenging competitive programming benchmark, with both the 8B and 70B Llama models.
| Model | Valid | Test |
|---|---|---|
| AlphaCodium GPT-3.5 (5@100) | 25 | 17 |
| AlphaCodium GPT-4 (5@100) | 44 | 29 |
| MapCoder GPT-4 (1@19) | — | 28.5 |
| Llama 3.1 8B Instruct (1@3) | 8.9 | 10.5 |
| Llama 3.1 8B + RLEF (1@3) | 17.2 | 16.0 |
| Llama 3.1 70B Instruct (1@3) | 25.9 | 27.5 |
| Llama 3.1 70B + RLEF (1@3) | 37.5 | 40.1 |
With 100 samples (33 rollouts of 3 turns each), the improvements persist:
| Model | Valid 10@100 | Test 10@100 |
|---|---|---|
| AlphaCode 41B + clustering (10@1000) | 21.0 | 16.4 |
| Llama 3.1 70B Instruct | 50.2 | 50.3 |
| Llama 3.1 70B + RLEF | 54.5 | 54.5 |
RLEF improvements on CodeContests transfer to other code benchmarks:
| Model | HumanEval+ 1@3 | MBPP+ 1@3 |
|---|---|---|
| 70B Instruct (multi-turn) | 75.0 | 70.2 |
| 70B + RLEF (multi-turn) | 80.4 | 72.2 |
| GPT-4o (multi-turn) | 80.7 | 71.7 |
Compare solve rates across different sample budgets. RLEF models achieve higher accuracy with far fewer samples.
The results show RLEF works. But how does it work? Do the models genuinely use feedback, or have they just gotten better at sampling diverse code? The paper's analysis section answers this definitively.
Across 5,640 rollouts (20 per problem, valid + test), the authors count: how many errors occurred at turn 1, and how many were fixed by turns 2 and 3.
chrF measures character n-gram overlap between successive code solutions. A chrF of 1.0 means the code is identical.
If the model truly uses feedback, performance should keep improving with more turns (up to a point). The authors test turn limits from 1 to 10:
Compare code similarity (chrF) between successive turns. Base models barely change their code. RLEF models make substantial rewrites. Click to toggle.
The decisive test. Replace real feedback with feedback from executing code against an unrelated problem:
How does RLEF relate to the competitive landscape of code generation methods? Let's place it precisely.
AlphaCode (Google DeepMind) generates up to 1,000,000 samples, clusters them, and picks one per cluster for submission. AlphaCode 2 improved this 10,000x in sample efficiency. RLEF's 70B model (37.5 valid, 10@100) beats AlphaCode 2's estimated performance (34.2 valid, 10@100) — and RLEF uses only 100 samples vs. potentially thousands for AlphaCode 2.
These are agentic frameworks that chain many LLM calls: chain-of-thought planning, test generation, program repair, etc. They use GPT-4 as the backbone. RLEF's key advantage: all the scaffolding is replaced by end-to-end training. The model learns when to rewrite vs. patch, without explicit scaffolding.
| Method | Model | Budget | Test Solve Rate |
|---|---|---|---|
| AlphaCodium | GPT-4 | 5@100 | 29.0 |
| MapCoder | GPT-4 | 1@19 | 28.5 |
| RLEF | Llama 70B | 1@3 | 40.1 |
Before RLEF, independent sampling was the rational strategy. After RLEF, multi-turn refinement strictly dominates independent sampling at every budget level. The reversal is complete: the model now benefits from seeing its errors, which is exactly what you'd expect an effective coder to do.
The paper includes an ablation comparing RLEF to supervised fine-tuning on mined repair trajectories (good multi-turn rollouts extracted from the 70B model):
| Method | Valid 1@3 | Test 1@3 |
|---|---|---|
| Few-shot (8B) | 8.5 | 8.5 |
| SFT (8B) | 10.3 | 10.0 |
| RLEF (8B) | 17.2 | 16.0 |
| SFT (70B) | 27.7 | 27.2 |
| RLEF (70B) | 37.5 | 40.1 |
RLEF outperforms SFT by a large margin at both scales. The RL reward signal provides learning signal that imitation cannot.
An alternative approach: train one model for generation (single-turn) and a separate model for repair. The paper tests this (Appendix B.3): the two-model system achieves 14.8 valid / 12.6 test (8B), well below the single RLEF model at 17.2 / 16.0. End-to-end training beats the modular approach.
RLEF connects several research threads. Let's map where it fits.
RLHF (Reinforcement Learning from Human Feedback) uses a reward model trained on human preferences. RLEF uses execution results — an automatic reward signal. This is significant: code execution provides a perfect, zero-noise reward. You never need to train a reward model or worry about reward hacking. The code either passes the tests or it doesn't.
Concurrent work by Kumar et al. (2024) proposes SCoRe, a two-stage RL method for self-correction that doesn't use execution feedback — it just asks the model to "reconsider." RLEF's advantage: the execution feedback provides concrete, actionable information about what went wrong. "Reconsider your answer" is weaker than "your code timed out on this input."
Prior RL-for-code work (Le et al., 2022; Dou et al., 2024) used execution rewards for single-turn generation. RLEF extends this to the multi-turn setting, which is both more natural (humans iterate on code) and more effective (the model can fix specific errors).
RLEF can be viewed as training an agent that interacts with a code execution environment. The key insight: rather than building scaffolding that structures the interaction (like AlphaCodium), RLEF trains the model to structure its own interaction. This "learned scaffolding" is more flexible and more sample-efficient.
| Aspect | RLEF |
|---|---|
| What it teaches | Reading execution feedback and fixing code iteratively |
| RL algorithm | PPO with KL penalty, binary reward from private tests |
| Training turns | 3 (generalizes to 5 at inference) |
| 8B improvement | 8.9 → 17.2 (valid), 10.5 → 16.0 (test) |
| 70B improvement | 25.9 → 37.5 (valid), 27.5 → 40.1 (test) |
| vs. AlphaCodium GPT-4 | 40.1 with 3 samples vs. 29 with 100 |
| Sample reduction | ~10x or more vs. prior SOTA |
| Key evidence | Random feedback ablation proves genuine feedback use |
| Generalization | HumanEval+, MBPP+, and more turns than training |