Li, Choi, Chung et al. — Google DeepMind, 2022

AlphaCode: Competition-Level Code Generation

The first AI system to reach competitive programmer level — top 54.3% on Codeforces. The secret is not a smarter model. It is generating a million solutions, then ruthlessly filtering down to ten using program behavior.

Prerequisites: Transformer basics + Encoder-decoder models + Sampling from language models
10
Chapters
3+
Simulations

Chapter 0: The Problem

Imagine sitting down for a Codeforces contest. You are given a multi-paragraph problem statement in English, often wrapped in a narrative ("Polycarp has n boxes..."). You must read it carefully, understand the hidden algorithmic structure, choose the right data structure, implement a correct solution in C++ or Python, and submit it within a few hours. A single wrong edge case, and the judge rejects your code.

This is competitive programming — and it is one of the hardest benchmarks for code generation. Unlike translating a docstring into a short function (HumanEval), or filling in a missing API call, competitive programming requires:

In early 2022, large language models like Codex could complete simple programming tasks — writing a function to reverse a string, or calling the right library. But when evaluated on actual competition problems, even the best models solved fewer than 5% of problems.

The core challenge: Generating code that solves a specified task requires searching a vast, structured space of possible programs, with an extremely sparse reward signal. A single character edit can completely change program behavior. Solutions can look dramatically different even for the same problem. And the test cases that determine correctness are hidden.

AlphaCode was DeepMind's answer to this challenge. In simulated evaluations on 10 recent Codeforces contests (each with 5,000+ participants), AlphaCode achieved an average ranking in the top 54.3% — an estimated Codeforces rating of 1238. This was the first time an AI system reached competitive programmer level on this widely-recognized benchmark.

What makes competitive programming fundamentally harder than simple code completion tasks like HumanEval?

Chapter 1: The Key Insight

If you ask a language model to write one solution to a hard competitive programming problem, it will almost certainly be wrong. The pass rate for any individual sample is tiny — often below 1%.

But here is the insight that makes AlphaCode work: if you generate enough samples, some of them will be correct. And if you have a way to identify the correct ones, you win.

Think of it like a lottery. Each individual ticket has a very low chance of winning. But if you buy a million tickets, your odds become quite good. The trick is that in programming, unlike in a lottery, you have a powerful verification tool: you can run the code.

Generate massively, then filter. AlphaCode generates roughly 1,000,000 candidate solutions per problem. It filters them down to about 10 using automated testing. This is a fundamentally different strategy from trying to produce one perfect answer. It decouples generation quality from selection quality, and bets on scale for the former and program execution for the latter.

This approach has three critical requirements:

1. A model that puts some probability mass on correct solutions
Even if p(correct) is 0.001%, with 106 samples you expect ~10 correct ones
2. Efficient large-scale sampling
Generating 106 programs must be fast enough to be practical
3. A filtering pipeline that picks winners
From 106 down to 10 submissions, without access to hidden tests

The paper shows that solve rate scales log-linearly with the number of samples. Double the sample budget, and you get a roughly constant improvement in solve rate. This scaling holds from 10 samples all the way to 1,000,000.

ApproachStrategySolve Rate on CodeContests
Codex 12B1 sample, no filtering~1%
AlphaCode 1B1k samples, filter to 10~12%
AlphaCode 41B1M samples, filter + cluster to 10~34%

Better models also have steeper scaling slopes — each doubling of samples yields a bigger improvement for larger models. This means model quality and sample quantity are complementary, not substitutes.

Why does AlphaCode's solve rate scale log-linearly with the number of samples?

Chapter 2: The CodeContests Dataset

A good benchmark for code generation needs to answer one question reliably: did this program actually solve the problem? Surprisingly, existing datasets were terrible at this.

The false positive problem

When you check a program by running it against test cases, insufficient test coverage means some wrong programs pass all tests. These are false positives. The paper found alarming rates in existing benchmarks:

DatasetTests / ProblemFalse Positive Rate
APPS2160%
HumanEval7.830%
CodeContests (raw)12.462%
CodeContests (final)203.74%

If 60% of "correct" solutions are actually wrong, the benchmark is useless. You cannot measure progress if your ruler is broken.

How CodeContests fixes this

The authors built CodeContests by combining scraped Codeforces data with Description2Code and CodeNet. Three design choices reduced the false positive rate from 62% to 4%:

1. Generated test cases
Mutate existing inputs (bit flips, increment/decrement integers, swap characters). Verify mutations by running 30 known-correct solutions and keeping only inputs where all agree on the output. Up to 200 generated tests per problem.
2. Filter out weak problems
Keep only problems with ≥5 hidden/generated tests that produce ≥2 different outputs. This prevents a model from trivially solving problems by always outputting "YES".
3. Strict temporal split
All training data appeared online before 2021/07/14. Validation problems: 2021/07/15–09/20. Test problems: after 2021/09/21. No data leakage possible.
Why the temporal split matters: Competitive programmers routinely study past contest solutions. A temporal split mirrors this reality — the model can learn from past problems, but the evaluation problems are genuinely new. This is stricter than random splits, which risk information leakage through similar problems.

Dataset statistics

SplitProblemsExample TestsHidden TestsGenerated Tests
Train13,3282.0 avg14.8 avg79.1 avg
Valid1171.5 avg12.9 avg190.0 avg
Test1651.7 avg9.4 avg192.7 avg

Each problem also includes correct and incorrect human submissions in C++, Python, and Java — hundreds of solutions per problem. These are essential for training, because the model needs to see both what works and what does not.

What a problem looks like

Here is a real Codeforces problem that AlphaCode solved (rated 1500, medium difficulty):

# Problem: "Backspace"
# You type string s character by character.
# Instead of typing a char, you can press Backspace,
# which deletes the last typed character.
# Can you obtain string t from s?
# Example: s="abcbd", press backspace on 1st and 4th chars → "bd"

# AlphaCode's generated solution (Python):
t = int(input())
for i in range(t):
  s = list(reversed(input()))
  b = list(reversed(input()))
  c = []
  while b and s:
    if s[0] == b[0]:
      c.append(b.pop(0))
      s.pop(0)
    elif len(s) != 1:
      s.pop(0); s.pop(0)  # backspace deletes 2
    else: s.pop(0)
  print("YES" if not b else "NO")

Notice what the model did: it reversed both strings and matched from the end, correctly handling the backspace-deletes-two-characters logic. This is not copying — the analysis found no matching logic in the training set. It is genuine algorithmic reasoning.

How did the authors reduce the false positive rate from 62% to 4%?

Chapter 3: The Architecture

Competitive programming is a sequence-to-sequence task: given a problem description X (natural language), produce a solution Y (code). This naturally motivates an encoder-decoder transformer.

Why encoder-decoder, not decoder-only?

Two reasons. First, the encoder can use bidirectional attention over the problem description — tokens at the beginning can attend to tokens at the end, which matters for understanding long, complex problem statements. Decoder-only models only see left context.

Second, the architecture allows an asymmetric design. Problem descriptions are typically twice as long as solutions (average ~1,500 vs. ~750 tokens). AlphaCode uses 1,536 tokens for the encoder and only 768 for the decoder. This asymmetry is critical for sampling efficiency.

Shallow encoder, deep decoder

An even more surprising design choice: the encoder is shallow (e.g. 5 blocks for the 1B model) and the decoder is deep (30 blocks). The encoder only needs to build a good representation of the problem. The decoder does the hard work of actually generating code token by token. This split improves training efficiency without hurting solve rate.

ModelParamsdmodelEnc BlocksDec BlocksSamples/TPU-sec
300M284M768424
1B1.1B14085304.74
3B2.8B2048636
9B8.7B3072848
41B41.1B6144856

Multi-query attention for fast sampling

Since AlphaCode needs to generate millions of samples, sampling speed is paramount. The model uses multi-query attention (Shazeer, 2019): full query heads but shared key and value heads per block. This drastically reduces memory usage and KV-cache update costs during autoregressive decoding, enabling larger batch sizes and much faster throughput.

The impact is dramatic. At 1B parameters, the encoder-decoder with multi-query attention generates 4.74 samples per TPU-second, compared to 1.23 for a decoder-only model and just 0.37 for standard multi-head attention. That is a 12.8x speedup — the difference between generating a million samples in hours versus days.

Training recipe

Training happens in two phases:

Pre-training on GitHub
715 GB of code across 12 languages. Encoder: masked language modeling. Decoder: next-token prediction. Split files at random pivot points.
Fine-tuning on CodeContests
Problem descriptions → encoder, solutions → decoder. Uses tempering (T=0.2), value conditioning (correct/incorrect labels), and GOLD (offline RL for one-of-many tasks).

Value conditioning

CodeContests contains both correct and incorrect solutions. Rather than discarding the incorrect ones, AlphaCode uses value conditioning: a tag inserted into the problem description tells the model whether the solution being generated is correct or incorrect. At training time, the model sees both. At sampling time, it is always conditioned on "CORRECT SOLUTION".

# What the encoder sees (prepended to the problem):
RATING : 1200
TAGS : dp, implementation
LANGUAGE IS python3
CORRECT SOLUTION
# ... followed by the full problem description

This is a remarkably simple trick. By telling the model "this is a correct solution," you bias the distribution toward programs that are actually right. The model also has an auxiliary value prediction head that classifies whether a solution is correct, providing an additional training signal.

Tempering is counterintuitive: Training with temperature T=0.2 makes the training distribution sharper, which paradoxically makes the inference distribution smoother. This prevents overfitting to the small fine-tuning dataset and preserves sample diversity at generation time. It is the opposite of the original suggestion in the tempering paper.

GOLD: training for one-of-many

Standard maximum likelihood training tries to put probability on every correct solution in the training set (like recall). But AlphaCode's metric only requires finding one correct solution (like precision). These objectives conflict.

GOLD (Pang and He, 2020) is an offline RL technique that resolves this. It lets the model concentrate probability mass on solutions it already assigns high likelihood to, and ignore solutions that are not in its distribution. Think of it as saying: "Don't try to learn every possible way to solve this problem. Master the approaches that come naturally to you."

GOLD improves the 10@100k solve rate from 20.2% to 21.5% on the 1B model — and the benefit compounds with larger sample budgets because the sharpened distribution produces more concentrated, higher-quality samples.

Why does AlphaCode use a shallow encoder and deep decoder?

Chapter 4: Massive Sampling

This is the heart of AlphaCode. For each problem, the system generates roughly one million candidate programs. Not a hundred. Not a thousand. A million.

Why so many? Because the probability that any single sample is correct is extremely low. Across all problems, less than 1% of samples pass even the public example tests. And among those that pass examples, many still fail on hidden tests. You need astronomical numbers to reliably find the needles in this haystack.

Diversity through metadata randomization

A million copies of the same wrong program are useless. AlphaCode ensures diversity through three mechanisms:

Why random tags beat true tags: You might expect that providing the correct tags would help. It does — slightly. But random tags per sample are even better, because the diversity they create across a million samples outweighs the accuracy advantage of correct tags. This was validated experimentally: random-per-sample (13.5% solve rate) > true tags (13.3%) > random-per-problem (11.9%).

The funnel

Here is what happens to those million samples. Click the button to simulate the filtering pipeline for a single problem.

Scaling behavior

The solve rate scales approximately log-linearly with the number of samples. This means every 10x increase in samples yields a roughly constant improvement:

Sample Budget10@k (41B)
1,00016.9%
10,00023.9%
100,00028.2%
1,000,00031.8%

With clustering added, the 1M-sample result jumps to 34.2%. That additional 2.4 percentage points from clustering is surprisingly large — it is what makes the difference between a mediocre result and a competitive one.

The compute trade-off

There is an interesting interplay between training compute and sampling compute. A larger, better-trained model produces higher-quality samples, so you need fewer of them. But each sample from a larger model costs more. The paper shows that both dimensions scale log-linearly:

This suggests an optimal allocation point: given a fixed compute budget, you should split it between training (model quality) and sampling (search breadth). The paper does not claim to have found this optimum, but the log-linear scaling curves provide the tools to estimate it.

Why does AlphaCode randomize tags and ratings when generating samples, rather than trying to predict the correct ones?

Chapter 5: The Filtering Pipeline

You have a million samples. You need ten. How do you pick them?

The first filter is the simplest and most powerful: run each sample against the public example tests from the problem statement. These are the small input/output pairs that Codeforces includes in every problem.

Stage 1: Example test filtering

This single step removes approximately 99% of all samples. Most generated programs are wrong in obvious ways — they crash, produce the wrong output format, or get the logic fundamentally wrong. The example tests catch all of this.

Model% Problems with ≥1 Passing SampleAvg % Samples Passing
300M82.1%0.39%
1B87.2%0.59%
9B89.7%0.76%
41B92.3%0.73%

Two things to notice. First, even after filtering 99% of a million samples, you still have thousands of candidates per problem. Second, for roughly 8-18% of problems (depending on model size), not a single sample passes the example tests from a million attempts. These are the hardest problems — the ones the system simply cannot solve yet.

The problem with random selection

After filtering, you could randomly pick 10 samples from the remaining thousands. But this is wasteful. Many of the surviving samples are syntactically different but semantically identical — they produce the same outputs on all inputs. Randomly selecting from this pool will often give you ten copies of the same (possibly wrong) behavior.

What you want is diversity: ten submissions that represent ten different algorithmic approaches. If one approach is wrong, another might be right.

The experimental proof: Without any filtering, solve rate is flat — more samples do not help at all when you can only submit 10. Filtering alone enables scaling, and clustering on top of filtering adds another significant boost. But there is still a large gap between clustering and the theoretical upper bound (submitting all samples), showing room for better selection methods.

The 10-submission constraint

Why exactly 10? This mirrors real competitive programming. On Codeforces, each wrong submission incurs a time penalty. Allowing unlimited submissions would let the system brute-force the hidden judge. The 10-submission limit forces the selection method to be smart, not just prolific.

The metric 10@k means: generate k total samples, filter and select 10 for submission, and check if any of those 10 passes all hidden tests. The complementary metric pass@k assumes all k samples can be submitted (an upper bound). The gap between 10@k and pass@k reveals how much room remains for better selection methods.

What filtering does NOT do

Filtering on example tests is a necessary but insufficient step. The example tests are small and few (1-2 per problem on average). They catch blatantly wrong solutions but miss subtle bugs that only appear on larger inputs or edge cases. That is why you need the next step: clustering.

Why does randomly selecting 10 samples (after filtering) waste the submission budget?

Chapter 6: Clustering by Behavior

You have thousands of programs that pass the example tests. You need to pick 10 that are as different as possible. How do you define "different"?

Not by looking at the code. Two programs can look completely different (one uses recursion, the other uses iteration) but produce identical outputs on every input. Conversely, two programs that look nearly identical might differ on one edge case.

AlphaCode's answer: cluster programs by their behavior — specifically, by the outputs they produce on a set of test inputs.

Generating test inputs

The system needs additional test inputs beyond the public examples. It trains a separate test-input generation model using the same architecture as the main model, initialized from the same GitHub pre-trained checkpoint. This model is trained to predict test inputs from problem descriptions, using example, hidden, and generated test inputs as training data.

At test time, this model generates new inputs for unseen problems. These inputs are not guaranteed to be valid (especially for problems with complex input constraints), but even imperfect or invalid inputs are useful for distinguishing program behavior.

The clustering algorithm

1. Generate test inputs
Use the trained input generation model to create new test inputs for the problem
2. Run all filtered programs
Execute every sample that passed the example tests on the generated inputs
3. Cluster by output
Group programs that produce identical outputs on all generated inputs into the same cluster
4. Select from largest to smallest
Pick one program from the largest cluster, then the next largest, cycling until 10 are selected
Why largest cluster first? There are many ways a solution can be wrong, but correct solutions tend to behave the same way — they all produce the correct output. So the largest cluster is most likely to contain the correct behavior. After submitting one from the largest cluster, you want diversity: try the second-largest (a different behavioral approach), then the third, and so on.

Impact of clustering

Clustering is not a minor optimization. It is the difference between a system that works and one that does not at high sample counts:

Method (1B model)10@10k10@100k10@1M
No filtering~5%~5%~5%
Filtering only~14%~18%~21%
Filtering + clustering~18%~24%~28%
Upper bound (pass@k)~20%~29%~37%

Without filtering, the solve rate is flat — more samples do not help at all. With filtering alone, it scales but slowly. Clustering closes a significant portion of the gap to the theoretical upper bound, especially at high sample counts where behavioral diversity matters most.

Why do correct solutions tend to end up in the largest behavioral cluster?

Chapter 7: Results

AlphaCode was evaluated in two settings: simulated live Codeforces contests and the CodeContests benchmark. Both tell a consistent story.

Codeforces contests

The system was evaluated on 10 recent Codeforces contests (December 2021), each with more than 5,000 participants. For each contest, AlphaCode generated samples for every problem, filtered and clustered them, and submitted the top candidates to the Codeforces judge.

MetricValue
Average rankingTop 54.3%
Estimated Codeforces rating1238
Percentile among active usersTop 28%
Average submissions per solved problem2.4
Best single-contest rankingTop 20.6%
Worst single-contest rankingTop 95.7%

There is high variance across contests. AlphaCode tends to solve the easier problems in each contest but occasionally solves problems rated as high as 1800 (well above its average skill level).

A milestone: This was the first time a computer system achieved competitive performance against human programmers on this benchmark. The estimated rating of 1238 places AlphaCode above the majority of humans who actively participate in competitive programming — a self-selected group of skilled programmers.

CodeContests benchmark

System10@1k10@10k10@100k10@1M
AlphaCode 9B16.9%22.6%27.1%30.1%
AlphaCode 41B16.9%23.9%28.2%31.8%
41B + clustering21.0%26.2%31.8%34.2%

The 41B model consistently outperforms the 9B, and clustering consistently provides an improvement across all sample budgets. With 1M samples and clustering, AlphaCode solves over a third of all problems in the validation set.

Ablation: what matters

A careful build-up ablation shows each enhancement contributes:

Enhancement (cumulative)10@100k
No enhancements (base enc-dec)15.2%
+ Masked language modeling (encoder)17.0%
+ Tempering (T=0.2)18.7%
+ Tag & rating conditioning19.3%
+ Value conditioning & prediction20.2%
+ GOLD training objective21.5%
+ Clustering24.1%

Starting from 15.2% and reaching 24.1%, each addition contributes a meaningful gain. Clustering alone adds nearly 3 percentage points — the single largest improvement besides scaling model size.

What was AlphaCode's estimated Codeforces rating, and what does it mean?

Chapter 8: Analysis

The headline result is impressive, but understanding how AlphaCode solves problems — and where it fails — reveals deeper insights.

What types of problems does it solve?

Problem TagSolve Rate (41B, 10@10k)Relative Strength
Sortings33.8%Strong
Greedy25.0%Strong
Math28.2%Strong
Bitmasks20.4%Good
Brute Force15.7%Moderate
Constructive14.9%Moderate
Graphs16.5%Moderate
DP8.8%Weak
Data Structures13.6%Moderate

AlphaCode is strongest on problems that require greedy algorithms, sorting, and direct math — areas where a straightforward approach often works. It is weakest on dynamic programming, which requires planning a multi-step recurrence and is one of the hardest categories even for humans.

Does it copy from training data?

A natural concern: maybe AlphaCode just memorizes solutions from training. The authors conducted a thorough analysis:

Not copying — composing. AlphaCode appears to learn patterns and idioms from its training data and compose them into novel solutions, much like a human programmer draws on remembered techniques. It does not retrieve and paste full solutions.

Syntactic correctness

A natural question: how often does the model even produce code that compiles? Python samples are mostly syntactically correct (the language is forgiving). C++ is harder, and smaller models produce more syntax errors. But this matters less than you might think — samples with syntax errors are trivially filtered out in the first stage, and the million-sample budget easily absorbs the loss.

Sensitivity to problem descriptions

Experiments modifying problem descriptions reveal that AlphaCode genuinely reads and depends on the full problem statement:

Sensitivity details

Description ChangeSolve RateInterpretation
Original17.1%Baseline
Opposite problem0.1%Model is not keyword-matching
Related but different3.2%Understands problem structure
Verbose rewording19.4%Robust to paraphrasing
Simplified (remove narrative)15.7% (from 3.0%)Narrative is a real obstacle
7 synonyms substituted12.5%Minor impact
Remove spec section4.8%I/O format is critical

The "opposite problem" result (0.1%) is particularly telling. If the model were just matching keywords like "prime" to stored solutions about primes, it would still produce solutions. Instead, it correctly follows the (reversed) requirements and produces outputs that are wrong for the original problem — evidence of genuine comprehension.

Loss is a poor proxy for solve rate

One of the paper's more subtle findings: the validation language modeling loss starts increasing after ~50k fine-tuning steps (indicating overfitting), but the solve rate continues to improve. The hypothesis: the model reallocates probability mass from atypical solutions toward more typical ones. This hurts average loss but increases the chance that a sampled solution is correct.

Implication: If you stopped training when validation loss started rising (the standard practice), you would end up with a significantly worse model. The one-of-many nature of code generation means loss and task performance can diverge. This is a cautionary tale for anyone training code generation models.
Why does validation loss increase while solve rate continues to improve during fine-tuning?

Chapter 9: Connections

AlphaCode introduced the "generate massively, then filter" paradigm for code generation. Since its publication, this idea has echoed across multiple follow-up systems.

Large Language Monkeys (2024)

Brown et al. generalized AlphaCode's core insight: given enough samples, even weaker models can solve hard problems. They showed that coverage (the probability that at least one sample is correct) scales as a sigmoid function of the log sample count, and that different problems have different "critical sample counts" where coverage transitions from near-zero to near-one. This work formalized what AlphaCode demonstrated empirically.

AlphaCode 2 (2023)

DeepMind's follow-up, powered by Gemini, reached top 15% on Codeforces (estimated rating ~1650). Key improvements: better base model (Gemini vs. custom transformer), fine-tuning with a policy-value approach, and a more sophisticated filtering pipeline.

AlphaEvolve (2025)

While AlphaCode generates solutions from scratch and filters, AlphaEvolve takes a different approach: it evolves solutions using LLMs as mutation operators in an evolutionary loop. Instead of generating a million independent samples, AlphaEvolve iteratively improves a population of solutions. Where AlphaCode relies on brute-force sampling, AlphaEvolve relies on directed search.

CodeMonkeys (Ehrlich et al., 2025)

Took AlphaCode's scaling insight further: by scaling inference-time compute (more samples, longer generations), they pushed frontier models to solve 57.4% of SWE-bench Verified problems. CodeMonkeys showed that AlphaCode's approach extends beyond competitive programming to real-world software engineering tasks.

RLEF (Gehring et al., 2024)

Reinforcement Learning from Execution Feedback uses program execution results as the reward signal for RL fine-tuning. Where AlphaCode uses execution to filter at inference time, RLEF uses execution to improve the model at training time. The two approaches are complementary: a model trained with RLEF would need fewer samples to reach the same solve rate.

Limitations of AlphaCode

For all its success, AlphaCode has clear weaknesses:

The broader lesson

AlphaCode's lasting contribution is not any single architectural choice. It is the demonstration that verification is the bottleneck, not generation. In domains where you can automatically check solutions — by running code, checking proofs, evaluating fitness functions — the right strategy is to generate at scale and let the verifier do the work. This insight now drives approaches across code generation, mathematical reasoning, and scientific discovery.

The paradigm: AlphaCode showed the world that at sufficient scale, generation + verification beats any amount of careful single-shot reasoning. Every system that generates many candidates and filters by execution — from AlphaCode 2 to Large Language Monkeys to RLEF — is working in the paradigm that AlphaCode established.

arxiv.org/abs/2203.07814  |  ← Back to Veanors