AlphaCode — Veanors

Chapter 0: The Problem

Imagine sitting down for a Codeforces contest. You are given a multi-paragraph problem statement in English, often wrapped in a narrative ("Polycarp has n boxes..."). You must read it carefully, understand the hidden algorithmic structure, choose the right data structure, implement a correct solution in C++ or Python, and submit it within a few hours. A single wrong edge case, and the judge rejects your code.

This is competitive programming — and it is one of the hardest benchmarks for code generation. Unlike translating a docstring into a short function (HumanEval), or filling in a missing API call, competitive programming requires:

Deep reading comprehension — parsing long, deliberately tricky problem statements
Algorithmic reasoning — choosing between greedy, DP, graph search, number theory, and more
Precise implementation — solutions can be 100+ lines of careful, edge-case-handling code
Correctness under hidden tests — the judge runs your code against inputs you never see

In early 2022, large language models like Codex could complete simple programming tasks — writing a function to reverse a string, or calling the right library. But when evaluated on actual competition problems, even the best models solved fewer than 5% of problems.

The core challenge: Generating code that solves a specified task requires searching a vast, structured space of possible programs, with an extremely sparse reward signal. A single character edit can completely change program behavior. Solutions can look dramatically different even for the same problem. And the test cases that determine correctness are hidden.

AlphaCode was DeepMind's answer to this challenge. In simulated evaluations on 10 recent Codeforces contests (each with 5,000+ participants), AlphaCode achieved an average ranking in the top 54.3% — an estimated Codeforces rating of 1238. This was the first time an AI system reached competitive programmer level on this widely-recognized benchmark.

What makes competitive programming fundamentally harder than simple code completion tasks like HumanEval?

It requires deep algorithmic reasoning from complex natural language descriptions, with correctness checked against exhaustive hidden test cases — not just translating a specification into a short function The code needs to be written in C++ instead of Python The problem statements are longer

Chapter 1: The Key Insight

If you ask a language model to write one solution to a hard competitive programming problem, it will almost certainly be wrong. The pass rate for any individual sample is tiny — often below 1%.

But here is the insight that makes AlphaCode work: if you generate enough samples, some of them will be correct. And if you have a way to identify the correct ones, you win.

Think of it like a lottery. Each individual ticket has a very low chance of winning. But if you buy a million tickets, your odds become quite good. The trick is that in programming, unlike in a lottery, you have a powerful verification tool: you can run the code.

Generate massively, then filter. AlphaCode generates roughly 1,000,000 candidate solutions per problem. It filters them down to about 10 using automated testing. This is a fundamentally different strategy from trying to produce one perfect answer. It decouples generation quality from selection quality, and bets on scale for the former and program execution for the latter.

This approach has three critical requirements:

1. A model that puts some probability mass on correct solutions

Even if p(correct) is 0.001%, with 10⁶ samples you expect ~10 correct ones

↓

2. Efficient large-scale sampling

Generating 10⁶ programs must be fast enough to be practical

↓

3. A filtering pipeline that picks winners

From 10⁶ down to 10 submissions, without access to hidden tests

The paper shows that solve rate scales log-linearly with the number of samples. Double the sample budget, and you get a roughly constant improvement in solve rate. This scaling holds from 10 samples all the way to 1,000,000.

Approach	Strategy	Solve Rate on CodeContests
Codex 12B	1 sample, no filtering	~1%
AlphaCode 1B	1k samples, filter to 10	~12%
AlphaCode 41B	1M samples, filter + cluster to 10	~34%

Better models also have steeper scaling slopes — each doubling of samples yields a bigger improvement for larger models. This means model quality and sample quantity are complementary, not substitutes.

Why does AlphaCode's solve rate scale log-linearly with the number of samples?

Each sample is an independent draw from a distribution that places small but nonzero probability on correct solutions, so more draws exponentially increase the chance that at least one is correct The model learns from its previous samples More samples mean more training data

Chapter 2: The CodeContests Dataset

A good benchmark for code generation needs to answer one question reliably: did this program actually solve the problem? Surprisingly, existing datasets were terrible at this.

The false positive problem

When you check a program by running it against test cases, insufficient test coverage means some wrong programs pass all tests. These are false positives. The paper found alarming rates in existing benchmarks:

Dataset	Tests / Problem	False Positive Rate
APPS	21	60%
HumanEval	7.8	30%
CodeContests (raw)	12.4	62%
CodeContests (final)	203.7	4%

If 60% of "correct" solutions are actually wrong, the benchmark is useless. You cannot measure progress if your ruler is broken.

How CodeContests fixes this

The authors built CodeContests by combining scraped Codeforces data with Description2Code and CodeNet. Three design choices reduced the false positive rate from 62% to 4%:

1. Generated test cases

Mutate existing inputs (bit flips, increment/decrement integers, swap characters). Verify mutations by running 30 known-correct solutions and keeping only inputs where all agree on the output. Up to 200 generated tests per problem.

↓

2. Filter out weak problems

Keep only problems with ≥5 hidden/generated tests that produce ≥2 different outputs. This prevents a model from trivially solving problems by always outputting "YES".

↓

3. Strict temporal split

All training data appeared online before 2021/07/14. Validation problems: 2021/07/15–09/20. Test problems: after 2021/09/21. No data leakage possible.

Why the temporal split matters: Competitive programmers routinely study past contest solutions. A temporal split mirrors this reality — the model can learn from past problems, but the evaluation problems are genuinely new. This is stricter than random splits, which risk information leakage through similar problems.

Dataset statistics

Split	Problems	Example Tests	Hidden Tests	Generated Tests
Train	13,328	2.0 avg	14.8 avg	79.1 avg
Valid	117	1.5 avg	12.9 avg	190.0 avg
Test	165	1.7 avg	9.4 avg	192.7 avg

Each problem also includes correct and incorrect human submissions in C++, Python, and Java — hundreds of solutions per problem. These are essential for training, because the model needs to see both what works and what does not.

What a problem looks like

Here is a real Codeforces problem that AlphaCode solved (rated 1500, medium difficulty):

# Problem: "Backspace"
# You type string s character by character.
# Instead of typing a char, you can press Backspace,
# which deletes the last typed character.
# Can you obtain string t from s?
# Example: s="abcbd", press backspace on 1st and 4th chars → "bd"

# AlphaCode's generated solution (Python):
t = int(input())
for i in range(t):
  s = list(reversed(input()))
  b = list(reversed(input()))
  c = []
  while b and s:
    if s[0] == b[0]:
      c.append(b.pop(0))
      s.pop(0)
    elif len(s) != 1:
      s.pop(0); s.pop(0)  # backspace deletes 2
    else: s.pop(0)
  print("YES" if not b else "NO")

Notice what the model did: it reversed both strings and matched from the end, correctly handling the backspace-deletes-two-characters logic. This is not copying — the analysis found no matching logic in the training set. It is genuine algorithmic reasoning.

How did the authors reduce the false positive rate from 62% to 4%?

By generating additional test cases through input mutation (verified against 30 correct solutions) and filtering out problems with insufficient test coverage By using more powerful language models to check solutions By only including easier problems in the dataset

Chapter 3: The Architecture

Competitive programming is a sequence-to-sequence task: given a problem description X (natural language), produce a solution Y (code). This naturally motivates an encoder-decoder transformer.

Why encoder-decoder, not decoder-only?

Two reasons. First, the encoder can use bidirectional attention over the problem description — tokens at the beginning can attend to tokens at the end, which matters for understanding long, complex problem statements. Decoder-only models only see left context.

Second, the architecture allows an asymmetric design. Problem descriptions are typically twice as long as solutions (average ~1,500 vs. ~750 tokens). AlphaCode uses 1,536 tokens for the encoder and only 768 for the decoder. This asymmetry is critical for sampling efficiency.

Shallow encoder, deep decoder

An even more surprising design choice: the encoder is shallow (e.g. 5 blocks for the 1B model) and the decoder is deep (30 blocks). The encoder only needs to build a good representation of the problem. The decoder does the hard work of actually generating code token by token. This split improves training efficiency without hurting solve rate.

Model	Params	d_model	Enc Blocks	Dec Blocks	Samples/TPU-sec
300M	284M	768	4	24	—
1B	1.1B	1408	5	30	4.74
3B	2.8B	2048	6	36	—
9B	8.7B	3072	8	48	—
41B	41.1B	6144	8	56	—

Multi-query attention for fast sampling

Since AlphaCode needs to generate millions of samples, sampling speed is paramount. The model uses multi-query attention (Shazeer, 2019): full query heads but shared key and value heads per block. This drastically reduces memory usage and KV-cache update costs during autoregressive decoding, enabling larger batch sizes and much faster throughput.

The impact is dramatic. At 1B parameters, the encoder-decoder with multi-query attention generates 4.74 samples per TPU-second, compared to 1.23 for a decoder-only model and just 0.37 for standard multi-head attention. That is a 12.8x speedup — the difference between generating a million samples in hours versus days.

Training recipe

Training happens in two phases:

Pre-training on GitHub

715 GB of code across 12 languages. Encoder: masked language modeling. Decoder: next-token prediction. Split files at random pivot points.

↓

Fine-tuning on CodeContests

Problem descriptions → encoder, solutions → decoder. Uses tempering (T=0.2), value conditioning (correct/incorrect labels), and GOLD (offline RL for one-of-many tasks).

Value conditioning

CodeContests contains both correct and incorrect solutions. Rather than discarding the incorrect ones, AlphaCode uses value conditioning: a tag inserted into the problem description tells the model whether the solution being generated is correct or incorrect. At training time, the model sees both. At sampling time, it is always conditioned on "CORRECT SOLUTION".

# What the encoder sees (prepended to the problem):
RATING : 1200
TAGS : dp, implementation
LANGUAGE IS python3
CORRECT SOLUTION
# ... followed by the full problem description

This is a remarkably simple trick. By telling the model "this is a correct solution," you bias the distribution toward programs that are actually right. The model also has an auxiliary value prediction head that classifies whether a solution is correct, providing an additional training signal.

Tempering is counterintuitive: Training with temperature T=0.2 makes the training distribution sharper, which paradoxically makes the inference distribution smoother. This prevents overfitting to the small fine-tuning dataset and preserves sample diversity at generation time. It is the opposite of the original suggestion in the tempering paper.

GOLD: training for one-of-many

Standard maximum likelihood training tries to put probability on every correct solution in the training set (like recall). But AlphaCode's metric only requires finding one correct solution (like precision). These objectives conflict.

GOLD (Pang and He, 2020) is an offline RL technique that resolves this. It lets the model concentrate probability mass on solutions it already assigns high likelihood to, and ignore solutions that are not in its distribution. Think of it as saying: "Don't try to learn every possible way to solve this problem. Master the approaches that come naturally to you."

GOLD improves the 10@100k solve rate from 20.2% to 21.5% on the 1B model — and the benefit compounds with larger sample budgets because the sharpened distribution produces more concentrated, higher-quality samples.

Why does AlphaCode use a shallow encoder and deep decoder?

The encoder only needs to build a representation of the problem statement, while the decoder does the harder work of generating code token-by-token — this split improves training efficiency without hurting solve rate Deeper encoders cause overfitting It reduces the total number of parameters

Chapter 4: Massive Sampling

This is the heart of AlphaCode. For each problem, the system generates roughly one million candidate programs. Not a hundred. Not a thousand. A million.

Why so many? Because the probability that any single sample is correct is extremely low. Across all problems, less than 1% of samples pass even the public example tests. And among those that pass examples, many still fail on hidden tests. You need astronomical numbers to reliably find the needles in this haystack.

Diversity through metadata randomization

A million copies of the same wrong program are useless. AlphaCode ensures diversity through three mechanisms:

Language mixing: Half the samples are generated in Python, half in C++. Different languages lead to different algorithmic approaches.
Tag randomization: The model is conditioned on problem "tags" (like "greedy", "dp", "binary search") that are randomly sampled from the 50 most popular tags. Different tags steer the model toward different algorithms.
Rating randomization: Difficulty ratings are sampled uniformly between 800 and 3500, producing solutions of varying complexity.

Why random tags beat true tags: You might expect that providing the correct tags would help. It does — slightly. But random tags per sample are even better, because the diversity they create across a million samples outweighs the accuracy advantage of correct tags. This was validated experimentally: random-per-sample (13.5% solve rate) > true tags (13.3%) > random-per-problem (11.9%).

The funnel

Here is what happens to those million samples. Click the button to simulate the filtering pipeline for a single problem.

Scaling behavior

The solve rate scales approximately log-linearly with the number of samples. This means every 10x increase in samples yields a roughly constant improvement:

Sample Budget	10@k (41B)
1,000	16.9%
10,000	23.9%
100,000	28.2%
1,000,000	31.8%

With clustering added, the 1M-sample result jumps to 34.2%. That additional 2.4 percentage points from clustering is surprisingly large — it is what makes the difference between a mediocre result and a competitive one.

The compute trade-off

There is an interesting interplay between training compute and sampling compute. A larger, better-trained model produces higher-quality samples, so you need fewer of them. But each sample from a larger model costs more. The paper shows that both dimensions scale log-linearly:

More training compute (bigger models) → steeper sampling curves → exponentially fewer samples needed for the same solve rate
More sampling compute (more samples) → higher solve rates regardless of model size

This suggests an optimal allocation point: given a fixed compute budget, you should split it between training (model quality) and sampling (search breadth). The paper does not claim to have found this optimum, but the log-linear scaling curves provide the tools to estimate it.

Why does AlphaCode randomize tags and ratings when generating samples, rather than trying to predict the correct ones?

Random metadata per sample maximizes the diversity of the million-sample pool, which is more valuable than the small accuracy gain from correct tags — different tags steer the model toward different algorithms, increasing coverage of the solution space The correct tags are not available at test time Random metadata trains the model better

Chapter 5: The Filtering Pipeline

You have a million samples. You need ten. How do you pick them?

The first filter is the simplest and most powerful: run each sample against the public example tests from the problem statement. These are the small input/output pairs that Codeforces includes in every problem.

Stage 1: Example test filtering

This single step removes approximately 99% of all samples. Most generated programs are wrong in obvious ways — they crash, produce the wrong output format, or get the logic fundamentally wrong. The example tests catch all of this.

Model	% Problems with ≥1 Passing Sample	Avg % Samples Passing
300M	82.1%	0.39%
1B	87.2%	0.59%
9B	89.7%	0.76%
41B	92.3%	0.73%

Two things to notice. First, even after filtering 99% of a million samples, you still have thousands of candidates per problem. Second, for roughly 8-18% of problems (depending on model size), not a single sample passes the example tests from a million attempts. These are the hardest problems — the ones the system simply cannot solve yet.

The problem with random selection

After filtering, you could randomly pick 10 samples from the remaining thousands. But this is wasteful. Many of the surviving samples are syntactically different but semantically identical — they produce the same outputs on all inputs. Randomly selecting from this pool will often give you ten copies of the same (possibly wrong) behavior.

What you want is diversity: ten submissions that represent ten different algorithmic approaches. If one approach is wrong, another might be right.

The experimental proof: Without any filtering, solve rate is flat — more samples do not help at all when you can only submit 10. Filtering alone enables scaling, and clustering on top of filtering adds another significant boost. But there is still a large gap between clustering and the theoretical upper bound (submitting all samples), showing room for better selection methods.

The 10-submission constraint

Why exactly 10? This mirrors real competitive programming. On Codeforces, each wrong submission incurs a time penalty. Allowing unlimited submissions would let the system brute-force the hidden judge. The 10-submission limit forces the selection method to be smart, not just prolific.

The metric 10@k means: generate k total samples, filter and select 10 for submission, and check if any of those 10 passes all hidden tests. The complementary metric pass@k assumes all k samples can be submitted (an upper bound). The gap between 10@k and pass@k reveals how much room remains for better selection methods.

What filtering does NOT do

Filtering on example tests is a necessary but insufficient step. The example tests are small and few (1-2 per problem on average). They catch blatantly wrong solutions but miss subtle bugs that only appear on larger inputs or edge cases. That is why you need the next step: clustering.

Why does randomly selecting 10 samples (after filtering) waste the submission budget?

Many filtered samples are syntactically different but semantically identical — they produce the same outputs on all inputs — so random selection wastes submissions on redundant behavioral copies instead of covering diverse algorithmic approaches Random selection is slower than clustering The example tests are not reliable enough

Chapter 6: Clustering by Behavior

You have thousands of programs that pass the example tests. You need to pick 10 that are as different as possible. How do you define "different"?

Not by looking at the code. Two programs can look completely different (one uses recursion, the other uses iteration) but produce identical outputs on every input. Conversely, two programs that look nearly identical might differ on one edge case.

AlphaCode's answer: cluster programs by their behavior — specifically, by the outputs they produce on a set of test inputs.

Generating test inputs

The system needs additional test inputs beyond the public examples. It trains a separate test-input generation model using the same architecture as the main model, initialized from the same GitHub pre-trained checkpoint. This model is trained to predict test inputs from problem descriptions, using example, hidden, and generated test inputs as training data.

At test time, this model generates new inputs for unseen problems. These inputs are not guaranteed to be valid (especially for problems with complex input constraints), but even imperfect or invalid inputs are useful for distinguishing program behavior.

The clustering algorithm

1. Generate test inputs

Use the trained input generation model to create new test inputs for the problem

↓

2. Run all filtered programs

Execute every sample that passed the example tests on the generated inputs

↓

3. Cluster by output

Group programs that produce identical outputs on all generated inputs into the same cluster

↓

4. Select from largest to smallest

Pick one program from the largest cluster, then the next largest, cycling until 10 are selected

Why largest cluster first? There are many ways a solution can be wrong, but correct solutions tend to behave the same way — they all produce the correct output. So the largest cluster is most likely to contain the correct behavior. After submitting one from the largest cluster, you want diversity: try the second-largest (a different behavioral approach), then the third, and so on.

Impact of clustering

Clustering is not a minor optimization. It is the difference between a system that works and one that does not at high sample counts:

Method (1B model)	10@10k	10@100k	10@1M
No filtering	~5%	~5%	~5%
Filtering only	~14%	~18%	~21%
Filtering + clustering	~18%	~24%	~28%
Upper bound (pass@k)	~20%	~29%	~37%

Without filtering, the solve rate is flat — more samples do not help at all. With filtering alone, it scales but slowly. Clustering closes a significant portion of the gap to the theoretical upper bound, especially at high sample counts where behavioral diversity matters most.

Why do correct solutions tend to end up in the largest behavioral cluster?

There are many different ways to be wrong but typically only one correct output per input, so correct programs converge to the same behavior and cluster together into the largest group The largest cluster has the most diverse code Larger clusters run faster

Chapter 7: Results

AlphaCode was evaluated in two settings: simulated live Codeforces contests and the CodeContests benchmark. Both tell a consistent story.

Codeforces contests

The system was evaluated on 10 recent Codeforces contests (December 2021), each with more than 5,000 participants. For each contest, AlphaCode generated samples for every problem, filtered and clustered them, and submitted the top candidates to the Codeforces judge.

Metric	Value
Average ranking	Top 54.3%
Estimated Codeforces rating	1238
Percentile among active users	Top 28%
Average submissions per solved problem	2.4
Best single-contest ranking	Top 20.6%
Worst single-contest ranking	Top 95.7%

There is high variance across contests. AlphaCode tends to solve the easier problems in each contest but occasionally solves problems rated as high as 1800 (well above its average skill level).

A milestone: This was the first time a computer system achieved competitive performance against human programmers on this benchmark. The estimated rating of 1238 places AlphaCode above the majority of humans who actively participate in competitive programming — a self-selected group of skilled programmers.

CodeContests benchmark

System	10@1k	10@10k	10@100k	10@1M
AlphaCode 9B	16.9%	22.6%	27.1%	30.1%
AlphaCode 41B	16.9%	23.9%	28.2%	31.8%
41B + clustering	21.0%	26.2%	31.8%	34.2%

The 41B model consistently outperforms the 9B, and clustering consistently provides an improvement across all sample budgets. With 1M samples and clustering, AlphaCode solves over a third of all problems in the validation set.

Ablation: what matters

A careful build-up ablation shows each enhancement contributes:

Enhancement (cumulative)	10@100k
No enhancements (base enc-dec)	15.2%
+ Masked language modeling (encoder)	17.0%
+ Tempering (T=0.2)	18.7%
+ Tag & rating conditioning	19.3%
+ Value conditioning & prediction	20.2%
+ GOLD training objective	21.5%
+ Clustering	24.1%

Starting from 15.2% and reaching 24.1%, each addition contributes a meaningful gain. Clustering alone adds nearly 3 percentage points — the single largest improvement besides scaling model size.

What was AlphaCode's estimated Codeforces rating, and what does it mean?

1238, which places it in the top 28% of all active competitive programmers on the platform — a self-selected group of skilled coders — making it the first AI system to reach this level 1238, which means it can solve 12.38% of problems 2500, placing it among the very best programmers

Chapter 8: Analysis

The headline result is impressive, but understanding how AlphaCode solves problems — and where it fails — reveals deeper insights.

What types of problems does it solve?

Problem Tag	Solve Rate (41B, 10@10k)	Relative Strength
Sortings	33.8%	Strong
Greedy	25.0%	Strong
Math	28.2%	Strong
Bitmasks	20.4%	Good
Brute Force	15.7%	Moderate
Constructive	14.9%	Moderate
Graphs	16.5%	Moderate
DP	8.8%	Weak
Data Structures	13.6%	Moderate

AlphaCode is strongest on problems that require greedy algorithms, sorting, and direct math — areas where a straightforward approach often works. It is weakest on dynamic programming, which requires planning a multi-step recurrence and is one of the hardest categories even for humans.

Does it copy from training data?

A natural concern: maybe AlphaCode just memorizes solutions from training. The authors conducted a thorough analysis:

The longest common substrings between model solutions and the entire training dataset are comparable in length to those of human solutions
Shared substrings are mostly boilerplate (input parsing, FastIO classes), not core algorithmic logic
A detailed decomposition of 50 solutions into training-data substrings found no evidence of copied core logic
Model solutions contain approximately the same amount of dead code as human solutions

Not copying — composing. AlphaCode appears to learn patterns and idioms from its training data and compose them into novel solutions, much like a human programmer draws on remembered techniques. It does not retrieve and paste full solutions.

Syntactic correctness

A natural question: how often does the model even produce code that compiles? Python samples are mostly syntactically correct (the language is forgiving). C++ is harder, and smaller models produce more syntax errors. But this matters less than you might think — samples with syntax errors are trivially filtered out in the first stage, and the million-sample budget easily absorbs the loss.

Sensitivity to problem descriptions

Experiments modifying problem descriptions reveal that AlphaCode genuinely reads and depends on the full problem statement:

Simplified descriptions (removing narrative) increase solve rate from 3.0% to 15.7% — the narrative genuinely makes things harder
Opposite problems (reversed requirements) drop solve rate from 17.1% to 0.1% — it is not just pattern-matching on keywords
Removing the specification (input/output format) severely degrades performance — the model depends on knowing the exact format
Synonym substitution has minimal effect (12.5% vs. 13.5%) — the model understands meaning, not specific words

Sensitivity details

Description Change	Solve Rate	Interpretation
Original	17.1%	Baseline
Opposite problem	0.1%	Model is not keyword-matching
Related but different	3.2%	Understands problem structure
Verbose rewording	19.4%	Robust to paraphrasing
Simplified (remove narrative)	15.7% (from 3.0%)	Narrative is a real obstacle
7 synonyms substituted	12.5%	Minor impact
Remove spec section	4.8%	I/O format is critical

The "opposite problem" result (0.1%) is particularly telling. If the model were just matching keywords like "prime" to stored solutions about primes, it would still produce solutions. Instead, it correctly follows the (reversed) requirements and produces outputs that are wrong for the original problem — evidence of genuine comprehension.

Loss is a poor proxy for solve rate

One of the paper's more subtle findings: the validation language modeling loss starts increasing after ~50k fine-tuning steps (indicating overfitting), but the solve rate continues to improve. The hypothesis: the model reallocates probability mass from atypical solutions toward more typical ones. This hurts average loss but increases the chance that a sampled solution is correct.

Implication: If you stopped training when validation loss started rising (the standard practice), you would end up with a significantly worse model. The one-of-many nature of code generation means loss and task performance can diverge. This is a cautionary tale for anyone training code generation models.

Why does validation loss increase while solve rate continues to improve during fine-tuning?

The model concentrates probability mass on typical correct solutions at the expense of atypical ones, worsening average loss across all solutions but increasing the chance that any given sample is correct The validation set is too small to be reliable The model is memorizing the training data

Chapter 9: Connections

AlphaCode introduced the "generate massively, then filter" paradigm for code generation. Since its publication, this idea has echoed across multiple follow-up systems.

Large Language Monkeys (2024)

Brown et al. generalized AlphaCode's core insight: given enough samples, even weaker models can solve hard problems. They showed that coverage (the probability that at least one sample is correct) scales as a sigmoid function of the log sample count, and that different problems have different "critical sample counts" where coverage transitions from near-zero to near-one. This work formalized what AlphaCode demonstrated empirically.

AlphaCode 2 (2023)

DeepMind's follow-up, powered by Gemini, reached top 15% on Codeforces (estimated rating ~1650). Key improvements: better base model (Gemini vs. custom transformer), fine-tuning with a policy-value approach, and a more sophisticated filtering pipeline.

AlphaEvolve (2025)

While AlphaCode generates solutions from scratch and filters, AlphaEvolve takes a different approach: it evolves solutions using LLMs as mutation operators in an evolutionary loop. Instead of generating a million independent samples, AlphaEvolve iteratively improves a population of solutions. Where AlphaCode relies on brute-force sampling, AlphaEvolve relies on directed search.

CodeMonkeys (Ehrlich et al., 2025)

Took AlphaCode's scaling insight further: by scaling inference-time compute (more samples, longer generations), they pushed frontier models to solve 57.4% of SWE-bench Verified problems. CodeMonkeys showed that AlphaCode's approach extends beyond competitive programming to real-world software engineering tasks.

RLEF (Gehring et al., 2024)

Reinforcement Learning from Execution Feedback uses program execution results as the reward signal for RL fine-tuning. Where AlphaCode uses execution to filter at inference time, RLEF uses execution to improve the model at training time. The two approaches are complementary: a model trained with RLEF would need fewer samples to reach the same solve rate.

Limitations of AlphaCode

For all its success, AlphaCode has clear weaknesses:

Compute cost: Generating and executing a million samples per problem requires massive compute. A single evaluation round uses thousands of TPU-hours.
Dynamic programming: AlphaCode is weakest on DP problems, which require planning multi-step recurrences — hard to stumble upon even with a million attempts.
Hard problems: It mostly solves easier contest problems (ratings 800-1500). Problems rated above 2000 are almost never solved.
No reasoning chain: The model generates code directly from the problem, with no explicit planning or reasoning step. Later systems (like AlphaCode 2) add deliberate reasoning.

The broader lesson

AlphaCode's lasting contribution is not any single architectural choice. It is the demonstration that verification is the bottleneck, not generation. In domains where you can automatically check solutions — by running code, checking proofs, evaluating fitness functions — the right strategy is to generate at scale and let the verifier do the work. This insight now drives approaches across code generation, mathematical reasoning, and scientific discovery.

The paradigm: AlphaCode showed the world that at sufficient scale, generation + verification beats any amount of careful single-shot reasoning. Every system that generates many candidates and filters by execution — from AlphaCode 2 to Large Language Monkeys to RLEF — is working in the paradigm that AlphaCode established.

arxiv.org/abs/2203.07814 | ← Back to Veanors

AlphaCode: Competition-Level Code Generation