RLEF — Veanors

Chapter 0: The Problem

You give a code LLM a competitive programming problem. It writes a solution. The solution fails some test cases. You paste the error message back into the conversation. The model tries again.

Intuitively, this should work. The model saw the error. It knows what went wrong. It should be able to fix it. But here's the uncomfortable truth: state-of-the-art code LLMs are worse at iterative refinement than at simply generating independent solutions from scratch.

The refinement paradox: If you have a budget of 3 LLM calls, you're better off asking the model for 3 independent solutions than asking it to try once, read the error, and try twice more. The model that "sees" the error performs worse than the model that's blind to it. This holds for GPT-4, Llama 3.1, and essentially every instruction-tuned LLM tested.

Why? Because instruction-tuned models weren't trained to use execution feedback. They were trained to follow instructions and produce plausible text. When they see an error message, they often make minimal changes — sometimes literally copying the same code again — or make random edits unrelated to the actual error. They don't know how to parse "RuntimeError: list index out of range" and trace it back to the specific off-by-one bug in their loop.

Olausson et al. (2024) showed that large models are needed just to provide useful error analysis, and that multiple rounds of repair don't help. Kapoor et al. (2024) demonstrated that independent sampling beats self-repair when you account for the compute budget. The entire iterative refinement story was, practically speaking, broken.

Refinement vs Independent Sampling

Click "Run Experiment" to see how a base LLM performs when sampling independently vs. refining iteratively on the same problem. The refinement curve flatlines — the model doesn't improve from seeing errors.

Why are current code LLMs bad at iterative refinement despite seeing error messages?

They were trained on instruction-following, not on using execution feedback — so they make minimal or random edits instead of targeted fixes Error messages are too short to be useful The models can't understand Python syntax

Chapter 1: The Key Insight

If models can't use execution feedback because they weren't trained to, then train them to. Not with supervised examples of "here's an error, here's the fix." With reinforcement learning, where the only reward signal is: did your final code pass the tests?

This is the central idea of RLEF (Reinforcement Learning from Execution Feedback). Frame the entire multi-turn code generation process — write code, execute, read error, rewrite, execute again — as a single RL episode. The model's policy is its code generation behavior across all turns. The reward comes at the end: binary pass/fail on held-out test cases.

Why RL and not supervised fine-tuning? SFT requires ground-truth examples of good repair trajectories. Where do you get them? You'd need to mine successful multi-turn conversations from the model itself — but the model is bad at repair, so good examples are rare. RL sidesteps this by learning from the reward signal directly. The model discovers repair strategies through trial and error, guided by whether its final code actually works.

The paper shows this concretely: SFT on mined repair trajectories improves validation performance slightly but doesn't generalize to the test set. Few-shot prompting with repair examples actually hurts instruction-tuned models. Only RLEF produces robust, generalizable improvement.

The Two Skills RLEF Teaches

Skill 1: Task Alignment

Better first-attempt code generation. The model learns the specific domain (competitive programming) during RL, producing fewer errors on the initial try.

Skill 2: Feedback Grounding

Reading execution feedback and making targeted fixes. The model learns to parse error messages, identify root causes, and rewrite the faulty code — not just resample randomly.

Both skills matter. RLEF-trained models make fewer errors on turn 1 (task alignment) AND fix more errors on turns 2-3 (feedback grounding). The combination is what produces the 10x sample reduction over prior methods.

Why does RLEF use RL instead of supervised fine-tuning for teaching iterative repair?

Because RL is always better than SFT Because good repair trajectories are rare (the model is bad at repair), making SFT data scarce, while RL learns from the pass/fail reward signal directly without needing ground-truth demonstrations Because SFT requires more compute

Chapter 2: The Multi-Turn Loop

Let's walk through exactly what happens during one RLEF episode. The model gets a competitive programming problem and interacts with an execution environment over multiple turns.

Step by Step

1. Problem Prompt

The model receives a natural language problem description with example test cases (public tests). "Let's call a string beautiful if it does not contain a palindromic substring of length ≥ 2..."

↓

2. Code Generation (Turn 1)

The model generates a Python solution. This is its first attempt — cold start, no prior feedback.

↓

3. Public Test Execution

The generated code runs against public test cases. Three outcomes: all pass, some fail (with error details), or the code is syntactically invalid.

↓

4a. If FAIL: Feedback → Retry

The execution result (error type, failed inputs/outputs, exception traceback) is formatted as text and appended to the conversation. The model generates a new solution with the full history visible.

↓

4b. If PASS or Turn Limit

The current solution is submitted. It's evaluated against private test cases (the ones the model never sees). Binary reward: +1 if all private tests pass, -1 otherwise.

Two test sets, different roles. Public tests provide feedback (the model sees them). Private tests provide the reward (the model never sees them). This separation prevents the model from learning shortcuts like hard-coding expected outputs. It also means passing public tests doesn't guarantee correctness — the model must generalize.

A Concrete Example

From the paper's Figure 2:

Turn 1: Naive Solution

def min_cost_to_beautiful(s): while True: found = False; for length in range(len(s), 1, -1): ...
Result: "Execution took too long" — the brute-force approach times out on large inputs.

↓ feedback: TLE on input '5A baacb 13 15 23'

Turn 2: Optimized Solution

from functools import lru_cache; def is_beautiful(s): ...
The model reads "too long" and responds with memoization. Result: all public tests pass → submit.

The key observation: this error-to-fix pattern only emerges after RLEF training. Before RLEF, the model would typically output nearly identical code on turn 2, ignoring the timeout error.

Why does RLEF use separate public and private test sets?

To reduce computational cost during training Because private tests are harder than public tests Public tests give feedback for the model to iterate on, while private tests provide the reward signal — this prevents shortcutting by hard-coding expected outputs

Chapter 3: RLEF Training (Showcase)

Now let's look at the actual RL formulation. This is where RLEF gets interesting — the multi-turn code generation process is cast as a Markov Decision Process, and the model is optimized with PPO.

The MDP Formulation

The iterative code generation loop maps directly onto an MDP:

Observation o₀: The problem description (natural language + public test examples)
Action a_t: The model's textual response at turn t (containing code)
Observation o_t: All previous observations + actions + execution feedback from evaluating a_t-1
Episode end: Public tests pass, or turn limit reached
Reward: Binary, based on private test execution of the final solution

Partial observability. This is technically a POMDP — the reward depends on private tests the model can't see. The model observes only public test feedback but must produce code that generalizes to the full private test suite.

The Reward Function

At step t, with context c_t = (o₀, a₀, o₁, a₁, ..., o_t):

R(s_t, a_t) = r(s_t, a_t) − β log π(a_t|c_t) / ρ(a_t|c_t)

Where the task reward r is:

r(s_t, a_t) = +1 if episode ends and all tests pass
r(s_t, a_t) = −1 if episode ends and any test fails
r(s_t, a_t) = −0.2 if a_t does not contain valid code

The second term is a KL penalty that keeps the trained policy π from drifting too far from the initial policy ρ (the pre-trained Llama model). This serves as both entropy regularization and a guard against catastrophic forgetting. The constant β trades off task reward against staying close to the original model.

The −0.2 penalty for invalid code addresses a failure mode the authors discovered: without it, the model sometimes generates garbage text in non-final turns (since only the final turn's code matters for the reward). The small penalty encourages valid code at every step, which preserves the conversational structure needed for effective multi-turn interaction.

Token-Level Policy, Turn-Level Value

A subtle design choice: the policy operates at the token level (the LLM generates tokens one by one), but the value function operates at the turn level (it predicts the expected return from the last token of each prompt, before the model responds).

This hybrid approach outperformed both fully token-level and fully turn-level alternatives in ablations. The advantage for all tokens within a response is a single value:

A_t = −V(c_t) + Σ_i=t^T R(s_i, a_i)

And the KL penalty uses the geometric mean of token probabilities rather than the product. This prevents a bias toward shorter generations — important because intermediate (non-final) turns need full, detailed code, not abbreviated responses.

RLEF Training Episode

Step through a training episode. Watch the model generate code, receive feedback, regenerate, and receive a reward. The reward propagates back through all turns via PPO.

Click to start episode

Why does RLEF use token-level policy but turn-level value function?

This hybrid outperformed both fully token-level and fully turn-level alternatives — the policy needs fine-grained token control, while the value function naturally evaluates whole turns as the minimal unit of interaction with the execution environment Because turn-level policies are impossible to implement To reduce memory usage during training

Chapter 4: Execution Feedback

The execution feedback is the bridge between "the code failed" and "here's why." RLEF's power comes from teaching models to parse this bridge. Let's look at what the model actually sees.

Types of Feedback

Error Type	What the Model Sees	What It Means
OUTPUT	Expected: "3 1 2", Got: "3 2 1"	Code runs but produces wrong answer
EXCEPTION	RuntimeError: list index out of range at line 14	Code crashes with a traceback
TIMEOUT	Execution took too long (TLE)	Algorithm too slow for the input size
OOM	MemoryError: allocation exceeded limit	Algorithm uses too much memory

Each error type requires a different kind of fix. Wrong output means a logic bug. Exceptions mean a code bug. Timeouts mean the algorithm is wrong — you need a fundamentally different approach, not just a line fix. OOM is similar: the data structure is wrong.

The feedback template is structured. It includes: which test cases failed, what input was given, what output was expected vs. produced, and any exception traceback. This is formatted as natural language and inserted into the conversation. The model sees the full history: problem description, turn 1 code, turn 1 feedback, and must produce turn 2 code.

What Changes After RLEF

Figure 3 in the paper is striking. Before RLEF, looking at 5,640 rollouts (20 per problem):

8B base model: ~3000 errors on turn 1. Of those, only ~100 are fixed on turn 2. The model barely repairs anything.
8B + RLEF: ~2500 errors on turn 1 (fewer initial errors). Of those, ~250 are fixed on turn 2 and ~100 more on turn 3. Real, targeted repair.
70B + RLEF: Even more dramatic improvement in error recovery across all error types.

The chrF evidence is damning. chrF measures character-level similarity between successive code outputs. Base models have chrF close to 1.0 — meaning they literally copy-paste their previous code with minimal changes. RLEF models have much lower chrF, indicating substantial code rewrites between turns. They're not making cosmetic tweaks; they're genuinely rewriting the solution.

Random Feedback Ablation

To prove the model actually reads the feedback (rather than just using the retry opportunity to sample a different solution), the authors run an ablation: replace real execution feedback with feedback from executing code on an unrelated problem.

Result: with random feedback, error recovery drops dramatically. This confirms that RLEF-trained models are genuinely parsing and using the error information, not just treating the retry as another independent sampling opportunity.

How did the authors prove that RLEF models actually read the execution feedback rather than just sampling randomly?

They checked the model's attention weights They removed the feedback entirely They replaced real feedback with feedback from unrelated problems — error recovery dropped sharply, proving the model depends on the actual content of the feedback

Chapter 5: Architecture

RLEF is intentionally simple in its architecture. There's no special scaffolding, no chain-of-thought modules, no external test generators. Just a standard LLM backbone, a conversation format, and PPO.

The LLM Backbone

The authors use Llama 3.1 Instruct models at 8B and 70B parameters. These are already instruction-tuned and capable of code generation — RLEF doesn't start from a raw pre-trained model. The instruction tuning provides the baseline conversational ability, and RLEF adds the multi-turn feedback-grounding capability on top.

No code-specific model needed. RLEF works on general instruction-tuned models. The authors also test on the older Llama 3.0 8B and get substantial improvements (4.1 → 12.5 on valid set 1@3), suggesting RLEF can partially substitute for instruction tuning on code tasks.

Training Setup

Parameter	8B Model	70B Model
PPO updates	12,000	8,000
Turn limit (training)	3	3
Training data	CodeContests training set (~12,659 problems)
Checkpoint selection	Based on valid set performance
Sampling temperature	0.2 (1@3), 1.0 (10@100)
KL penalty β	Tuned to balance task reward vs. distribution drift

Multi-Turn Context

The conversation format is straightforward. At each turn, the model's context window contains:

Position 1

Original problem description + public test examples

↓

Position 2

Model's turn-1 code solution

↓

Position 3

Execution feedback from turn 1 (formatted error messages)

↓

Position 4

Model's turn-2 code solution (to be generated)

No retrieval augmentation. No chain-of-thought prompting. No external tool use beyond code execution. The simplicity is a feature: the model itself learns when and how to use feedback, rather than relying on scaffolding to structure the interaction.

Reward Design Choices

Three specific choices worth noting:

Binary reward (+1/-1): No partial credit for passing some tests. This is harsh but clear — it forces the model to aim for full correctness, not just partial solutions.
No reward discounting (γ = 1): All turns contribute equally to the reward. Turn 1 matters as much as turn 3. This encourages the model to produce good code from the start while also learning to repair.
Geometric mean KL: Using the geometric mean of token probabilities for the KL penalty prevents bias toward shorter responses, keeping intermediate turns at full code length.

What is the primary advantage of RLEF's architectural simplicity (no scaffolding, no external tools)?

The model itself learns when and how to use feedback — this capability is embedded in the weights rather than hard-coded in external scaffolding, making it more robust and generalizable Simpler architectures are always faster External tools are not available for competitive programming

Chapter 6: Results

RLEF achieves new state-of-the-art results on CodeContests, a challenging competitive programming benchmark, with both the 8B and 70B Llama models.

CodeContests Results (1@3: one rollout, 3 turns)

Model	Valid	Test
AlphaCodium GPT-3.5 (5@100)	25	17
AlphaCodium GPT-4 (5@100)	44	29
MapCoder GPT-4 (1@19)	—	28.5
Llama 3.1 8B Instruct (1@3)	8.9	10.5
Llama 3.1 8B + RLEF (1@3)	17.2	16.0
Llama 3.1 70B Instruct (1@3)	25.9	27.5
Llama 3.1 70B + RLEF (1@3)	37.5	40.1

The 70B RLEF model with 3 samples beats AlphaCodium GPT-4 with 100 samples on the test set (40.1 vs 29). That's a 10x reduction in sample budget to achieve superior performance. The 8B RLEF model (16.0 test, 3 samples) is competitive with AlphaCode 9B (13.3 test, 1000 samples).

Larger Budget: 10@100

With 100 samples (33 rollouts of 3 turns each), the improvements persist:

Model	Valid 10@100	Test 10@100
AlphaCode 41B + clustering (10@1000)	21.0	16.4
Llama 3.1 70B Instruct	50.2	50.3
Llama 3.1 70B + RLEF	54.5	54.5

Generalization to Other Benchmarks

RLEF improvements on CodeContests transfer to other code benchmarks:

Model	HumanEval+ 1@3	MBPP+ 1@3
70B Instruct (multi-turn)	75.0	70.2
70B + RLEF (multi-turn)	80.4	72.2
GPT-4o (multi-turn)	80.7	71.7

The 70B RLEF model matches GPT-4o on HumanEval+ and MBPP+ in the multi-turn setting. And this is with a model fine-tuned only on CodeContests — the feedback-grounding capability generalizes across benchmarks with different problems and different feedback formats.

RLEF Results: Sample Efficiency

Compare solve rates across different sample budgets. RLEF models achieve higher accuracy with far fewer samples.

What is the most surprising result from the RLEF experiments?

The 70B RLEF model with 3 samples outperforms AlphaCodium GPT-4 with 100 samples — a 10x+ reduction in sample budget with superior accuracy The 8B model improves more than the 70B model RLEF only works on CodeContests

Chapter 7: Behavior Analysis

The results show RLEF works. But how does it work? Do the models genuinely use feedback, or have they just gotten better at sampling diverse code? The paper's analysis section answers this definitively.

Evidence 1: Error Recovery Rates

Across 5,640 rollouts (20 per problem, valid + test), the authors count: how many errors occurred at turn 1, and how many were fixed by turns 2 and 3.

8B Instruct: ~3000 turn-1 errors. Only ~100 fixed by turn 2. Almost zero improvement from feedback.
8B + RLEF: ~2500 turn-1 errors (fewer to start). ~250 fixed by turn 2, ~100 more by turn 3. The model repairs across all error types: OUTPUT, EXCEPTION, TIMEOUT, and OOM.
70B + RLEF: Even larger recovery rates, especially for EXCEPTION and OUTPUT errors.

RLEF doesn't just fix easy bugs. The model learns to handle TIMEOUT errors — which require fundamentally changing the algorithm, not just patching a line. This is strong evidence that RLEF teaches genuine problem-solving, not template matching.

Evidence 2: Code Diversity (chrF)

chrF measures character n-gram overlap between successive code solutions. A chrF of 1.0 means the code is identical.

Base Instruct models: chrF concentrated near 1.0 — the model copies its code with minimal changes, ignoring the feedback.
RLEF models: chrF distribution shifts dramatically toward lower values (0.2-0.8), indicating substantial code rewrites. The model is genuinely restructuring its solution.

Evidence 3: Turn Limit Scaling

If the model truly uses feedback, performance should keep improving with more turns (up to a point). The authors test turn limits from 1 to 10:

Base models: Best performance at 1 turn (independent sampling). More turns don't help or hurt.
RLEF models: Performance improves consistently up to 5 turns. 3 and 5 turns strictly dominate 1 turn. Beyond 5, returns diminish (10 turns provides no further benefit).

5 turns is the sweet spot. RLEF-trained models can effectively leverage up to 5 rounds of feedback, far beyond the 3 turns used during training. This generalization to more turns than seen in training is remarkable — the model has learned a general repair skill, not just a 3-turn pattern.

Do Models Actually Use Feedback?

Compare code similarity (chrF) between successive turns. Base models barely change their code. RLEF models make substantial rewrites. Click to toggle.

Evidence 4: Random Feedback Ablation

The decisive test. Replace real feedback with feedback from executing code against an unrelated problem:

pass@1 (did the final solution pass?): Drops significantly with random feedback, especially at higher turn limits. The model can't reliably converge on the right fix when the feedback is misleading.
pass@10 (did any of 10 solutions pass?): Drops less, because even with bad feedback, sampling diversity can find a solution. But the drop in pass@1 confirms the model uses feedback for targeted repair, not just diversity.

What does the chrF analysis reveal about RLEF-trained models vs base models?

RLEF models produce shorter code Base models barely change code between turns (chrF near 1.0, essentially copy-pasting), while RLEF models make substantial rewrites (chrF 0.2-0.8), indicating they genuinely restructure their solutions based on feedback Both models change code equally but RLEF changes are higher quality

Chapter 8: Comparisons

How does RLEF relate to the competitive landscape of code generation methods? Let's place it precisely.

vs. AlphaCode / AlphaCode 2

AlphaCode (Google DeepMind) generates up to 1,000,000 samples, clusters them, and picks one per cluster for submission. AlphaCode 2 improved this 10,000x in sample efficiency. RLEF's 70B model (37.5 valid, 10@100) beats AlphaCode 2's estimated performance (34.2 valid, 10@100) — and RLEF uses only 100 samples vs. potentially thousands for AlphaCode 2.

vs. AlphaCodium / MapCoder

These are agentic frameworks that chain many LLM calls: chain-of-thought planning, test generation, program repair, etc. They use GPT-4 as the backbone. RLEF's key advantage: all the scaffolding is replaced by end-to-end training. The model learns when to rewrite vs. patch, without explicit scaffolding.

Method	Model	Budget	Test Solve Rate
AlphaCodium	GPT-4	5@100	29.0
MapCoder	GPT-4	1@19	28.5
RLEF	Llama 70B	1@3	40.1

RLEF with 3 samples > AlphaCodium/MapCoder with 19-100 samples. And RLEF uses an open model (Llama 70B) vs. proprietary GPT-4. The trained model internalizes the scaffolding that these frameworks hard-code externally.

vs. Independent Sampling

Before RLEF, independent sampling was the rational strategy. After RLEF, multi-turn refinement strictly dominates independent sampling at every budget level. The reversal is complete: the model now benefits from seeing its errors, which is exactly what you'd expect an effective coder to do.

vs. SFT on Repair Trajectories

The paper includes an ablation comparing RLEF to supervised fine-tuning on mined repair trajectories (good multi-turn rollouts extracted from the 70B model):

Method	Valid 1@3	Test 1@3
Few-shot (8B)	8.5	8.5
SFT (8B)	10.3	10.0
RLEF (8B)	17.2	16.0
SFT (70B)	27.7	27.2
RLEF (70B)	37.5	40.1

RLEF outperforms SFT by a large margin at both scales. The RL reward signal provides learning signal that imitation cannot.

vs. Dedicated Repair Model

An alternative approach: train one model for generation (single-turn) and a separate model for repair. The paper tests this (Appendix B.3): the two-model system achieves 14.8 valid / 12.6 test (8B), well below the single RLEF model at 17.2 / 16.0. End-to-end training beats the modular approach.

Why does RLEF outperform agentic frameworks like AlphaCodium despite using far fewer samples and no scaffolding?

Because Llama 70B is a better model than GPT-4 Because competitive programming is too easy for scaffolding to help End-to-end RL training internalizes feedback-grounding into the model weights, replacing manually designed scaffolding with learned behavior that's more efficient per sample

Chapter 9: Connections

RLEF connects several research threads. Let's map where it fits.

Relation to RLHF

RLHF (Reinforcement Learning from Human Feedback) uses a reward model trained on human preferences. RLEF uses execution results — an automatic reward signal. This is significant: code execution provides a perfect, zero-noise reward. You never need to train a reward model or worry about reward hacking. The code either passes the tests or it doesn't.

Relation to Self-Correction (SCoRe)

Concurrent work by Kumar et al. (2024) proposes SCoRe, a two-stage RL method for self-correction that doesn't use execution feedback — it just asks the model to "reconsider." RLEF's advantage: the execution feedback provides concrete, actionable information about what went wrong. "Reconsider your answer" is weaker than "your code timed out on this input."

Relation to CodeRL / StepCoder

Prior RL-for-code work (Le et al., 2022; Dou et al., 2024) used execution rewards for single-turn generation. RLEF extends this to the multi-turn setting, which is both more natural (humans iterate on code) and more effective (the model can fix specific errors).

Relation to Agent Architectures

RLEF can be viewed as training an agent that interacts with a code execution environment. The key insight: rather than building scaffolding that structures the interaction (like AlphaCodium), RLEF trains the model to structure its own interaction. This "learned scaffolding" is more flexible and more sample-efficient.

Limitations and Open Questions

Single-file problems: RLEF works on competitive programming (single-solution problems). Scaling to multi-file software engineering (SWE-bench) requires decomposition capabilities.
Requires test cases: The reward comes from test execution. Domains without automatic evaluation can't use RLEF directly — though combining with automatic test generation is a natural extension.
Reduced diversity: RL training reduces output diversity, which explains why improvements are larger at small budgets (1@3) than large ones (10@100).

Cheat Sheet

Aspect	RLEF
What it teaches	Reading execution feedback and fixing code iteratively
RL algorithm	PPO with KL penalty, binary reward from private tests
Training turns	3 (generalizes to 5 at inference)
8B improvement	8.9 → 17.2 (valid), 10.5 → 16.0 (test)
70B improvement	25.9 → 37.5 (valid), 27.5 → 40.1 (test)
vs. AlphaCodium GPT-4	40.1 with 3 samples vs. 29 with 100
Sample reduction	~10x or more vs. prior SOTA
Key evidence	Random feedback ablation proves genuine feedback use
Generalization	HumanEval+, MBPP+, and more turns than training

The broader lesson: When automatic evaluation is available, end-to-end RL beats both scaffolding and supervised fine-tuning. The model should learn to use feedback, not have feedback use scripted for it. RLEF is a concrete demonstration that the "iterative refinement" workflow — write, test, debug, repeat — can be taught rather than hand-coded.

What is the key advantage of RLEF's execution-based reward over RLHF's human-preference reward for code generation?

Execution provides a perfect, zero-noise reward signal without needing a learned reward model — the code either passes or it doesn't, eliminating reward hacking RLEF requires less compute than RLHF Human preferences are always wrong for code

RLEF: Grounding Code LLMs in Execution Feedback