The first AI system to reach competitive programmer level — top 54.3% on Codeforces. The secret is not a smarter model. It is generating a million solutions, then ruthlessly filtering down to ten using program behavior.
Imagine sitting down for a Codeforces contest. You are given a multi-paragraph problem statement in English, often wrapped in a narrative ("Polycarp has n boxes..."). You must read it carefully, understand the hidden algorithmic structure, choose the right data structure, implement a correct solution in C++ or Python, and submit it within a few hours. A single wrong edge case, and the judge rejects your code.
This is competitive programming — and it is one of the hardest benchmarks for code generation. Unlike translating a docstring into a short function (HumanEval), or filling in a missing API call, competitive programming requires:
In early 2022, large language models like Codex could complete simple programming tasks — writing a function to reverse a string, or calling the right library. But when evaluated on actual competition problems, even the best models solved fewer than 5% of problems.
AlphaCode was DeepMind's answer to this challenge. In simulated evaluations on 10 recent Codeforces contests (each with 5,000+ participants), AlphaCode achieved an average ranking in the top 54.3% — an estimated Codeforces rating of 1238. This was the first time an AI system reached competitive programmer level on this widely-recognized benchmark.
If you ask a language model to write one solution to a hard competitive programming problem, it will almost certainly be wrong. The pass rate for any individual sample is tiny — often below 1%.
But here is the insight that makes AlphaCode work: if you generate enough samples, some of them will be correct. And if you have a way to identify the correct ones, you win.
Think of it like a lottery. Each individual ticket has a very low chance of winning. But if you buy a million tickets, your odds become quite good. The trick is that in programming, unlike in a lottery, you have a powerful verification tool: you can run the code.
This approach has three critical requirements:
The paper shows that solve rate scales log-linearly with the number of samples. Double the sample budget, and you get a roughly constant improvement in solve rate. This scaling holds from 10 samples all the way to 1,000,000.
| Approach | Strategy | Solve Rate on CodeContests |
|---|---|---|
| Codex 12B | 1 sample, no filtering | ~1% |
| AlphaCode 1B | 1k samples, filter to 10 | ~12% |
| AlphaCode 41B | 1M samples, filter + cluster to 10 | ~34% |
Better models also have steeper scaling slopes — each doubling of samples yields a bigger improvement for larger models. This means model quality and sample quantity are complementary, not substitutes.
A good benchmark for code generation needs to answer one question reliably: did this program actually solve the problem? Surprisingly, existing datasets were terrible at this.
When you check a program by running it against test cases, insufficient test coverage means some wrong programs pass all tests. These are false positives. The paper found alarming rates in existing benchmarks:
| Dataset | Tests / Problem | False Positive Rate |
|---|---|---|
| APPS | 21 | 60% |
| HumanEval | 7.8 | 30% |
| CodeContests (raw) | 12.4 | 62% |
| CodeContests (final) | 203.7 | 4% |
If 60% of "correct" solutions are actually wrong, the benchmark is useless. You cannot measure progress if your ruler is broken.
The authors built CodeContests by combining scraped Codeforces data with Description2Code and CodeNet. Three design choices reduced the false positive rate from 62% to 4%:
| Split | Problems | Example Tests | Hidden Tests | Generated Tests |
|---|---|---|---|---|
| Train | 13,328 | 2.0 avg | 14.8 avg | 79.1 avg |
| Valid | 117 | 1.5 avg | 12.9 avg | 190.0 avg |
| Test | 165 | 1.7 avg | 9.4 avg | 192.7 avg |
Each problem also includes correct and incorrect human submissions in C++, Python, and Java — hundreds of solutions per problem. These are essential for training, because the model needs to see both what works and what does not.
Here is a real Codeforces problem that AlphaCode solved (rated 1500, medium difficulty):
# Problem: "Backspace" # You type string s character by character. # Instead of typing a char, you can press Backspace, # which deletes the last typed character. # Can you obtain string t from s? # Example: s="abcbd", press backspace on 1st and 4th chars → "bd" # AlphaCode's generated solution (Python): t = int(input()) for i in range(t): s = list(reversed(input())) b = list(reversed(input())) c = [] while b and s: if s[0] == b[0]: c.append(b.pop(0)) s.pop(0) elif len(s) != 1: s.pop(0); s.pop(0) # backspace deletes 2 else: s.pop(0) print("YES" if not b else "NO")
Notice what the model did: it reversed both strings and matched from the end, correctly handling the backspace-deletes-two-characters logic. This is not copying — the analysis found no matching logic in the training set. It is genuine algorithmic reasoning.
Competitive programming is a sequence-to-sequence task: given a problem description X (natural language), produce a solution Y (code). This naturally motivates an encoder-decoder transformer.
Two reasons. First, the encoder can use bidirectional attention over the problem description — tokens at the beginning can attend to tokens at the end, which matters for understanding long, complex problem statements. Decoder-only models only see left context.
Second, the architecture allows an asymmetric design. Problem descriptions are typically twice as long as solutions (average ~1,500 vs. ~750 tokens). AlphaCode uses 1,536 tokens for the encoder and only 768 for the decoder. This asymmetry is critical for sampling efficiency.
An even more surprising design choice: the encoder is shallow (e.g. 5 blocks for the 1B model) and the decoder is deep (30 blocks). The encoder only needs to build a good representation of the problem. The decoder does the hard work of actually generating code token by token. This split improves training efficiency without hurting solve rate.
| Model | Params | dmodel | Enc Blocks | Dec Blocks | Samples/TPU-sec |
|---|---|---|---|---|---|
| 300M | 284M | 768 | 4 | 24 | — |
| 1B | 1.1B | 1408 | 5 | 30 | 4.74 |
| 3B | 2.8B | 2048 | 6 | 36 | — |
| 9B | 8.7B | 3072 | 8 | 48 | — |
| 41B | 41.1B | 6144 | 8 | 56 | — |
Since AlphaCode needs to generate millions of samples, sampling speed is paramount. The model uses multi-query attention (Shazeer, 2019): full query heads but shared key and value heads per block. This drastically reduces memory usage and KV-cache update costs during autoregressive decoding, enabling larger batch sizes and much faster throughput.
The impact is dramatic. At 1B parameters, the encoder-decoder with multi-query attention generates 4.74 samples per TPU-second, compared to 1.23 for a decoder-only model and just 0.37 for standard multi-head attention. That is a 12.8x speedup — the difference between generating a million samples in hours versus days.
Training happens in two phases:
CodeContests contains both correct and incorrect solutions. Rather than discarding the incorrect ones, AlphaCode uses value conditioning: a tag inserted into the problem description tells the model whether the solution being generated is correct or incorrect. At training time, the model sees both. At sampling time, it is always conditioned on "CORRECT SOLUTION".
# What the encoder sees (prepended to the problem): RATING : 1200 TAGS : dp, implementation LANGUAGE IS python3 CORRECT SOLUTION # ... followed by the full problem description
This is a remarkably simple trick. By telling the model "this is a correct solution," you bias the distribution toward programs that are actually right. The model also has an auxiliary value prediction head that classifies whether a solution is correct, providing an additional training signal.
Standard maximum likelihood training tries to put probability on every correct solution in the training set (like recall). But AlphaCode's metric only requires finding one correct solution (like precision). These objectives conflict.
GOLD (Pang and He, 2020) is an offline RL technique that resolves this. It lets the model concentrate probability mass on solutions it already assigns high likelihood to, and ignore solutions that are not in its distribution. Think of it as saying: "Don't try to learn every possible way to solve this problem. Master the approaches that come naturally to you."
GOLD improves the 10@100k solve rate from 20.2% to 21.5% on the 1B model — and the benefit compounds with larger sample budgets because the sharpened distribution produces more concentrated, higher-quality samples.
This is the heart of AlphaCode. For each problem, the system generates roughly one million candidate programs. Not a hundred. Not a thousand. A million.
Why so many? Because the probability that any single sample is correct is extremely low. Across all problems, less than 1% of samples pass even the public example tests. And among those that pass examples, many still fail on hidden tests. You need astronomical numbers to reliably find the needles in this haystack.
A million copies of the same wrong program are useless. AlphaCode ensures diversity through three mechanisms:
Here is what happens to those million samples. Click the button to simulate the filtering pipeline for a single problem.
The solve rate scales approximately log-linearly with the number of samples. This means every 10x increase in samples yields a roughly constant improvement:
| Sample Budget | 10@k (41B) |
|---|---|
| 1,000 | 16.9% |
| 10,000 | 23.9% |
| 100,000 | 28.2% |
| 1,000,000 | 31.8% |
With clustering added, the 1M-sample result jumps to 34.2%. That additional 2.4 percentage points from clustering is surprisingly large — it is what makes the difference between a mediocre result and a competitive one.
There is an interesting interplay between training compute and sampling compute. A larger, better-trained model produces higher-quality samples, so you need fewer of them. But each sample from a larger model costs more. The paper shows that both dimensions scale log-linearly:
This suggests an optimal allocation point: given a fixed compute budget, you should split it between training (model quality) and sampling (search breadth). The paper does not claim to have found this optimum, but the log-linear scaling curves provide the tools to estimate it.
You have a million samples. You need ten. How do you pick them?
The first filter is the simplest and most powerful: run each sample against the public example tests from the problem statement. These are the small input/output pairs that Codeforces includes in every problem.
This single step removes approximately 99% of all samples. Most generated programs are wrong in obvious ways — they crash, produce the wrong output format, or get the logic fundamentally wrong. The example tests catch all of this.
| Model | % Problems with ≥1 Passing Sample | Avg % Samples Passing |
|---|---|---|
| 300M | 82.1% | 0.39% |
| 1B | 87.2% | 0.59% |
| 9B | 89.7% | 0.76% |
| 41B | 92.3% | 0.73% |
Two things to notice. First, even after filtering 99% of a million samples, you still have thousands of candidates per problem. Second, for roughly 8-18% of problems (depending on model size), not a single sample passes the example tests from a million attempts. These are the hardest problems — the ones the system simply cannot solve yet.
After filtering, you could randomly pick 10 samples from the remaining thousands. But this is wasteful. Many of the surviving samples are syntactically different but semantically identical — they produce the same outputs on all inputs. Randomly selecting from this pool will often give you ten copies of the same (possibly wrong) behavior.
What you want is diversity: ten submissions that represent ten different algorithmic approaches. If one approach is wrong, another might be right.
Why exactly 10? This mirrors real competitive programming. On Codeforces, each wrong submission incurs a time penalty. Allowing unlimited submissions would let the system brute-force the hidden judge. The 10-submission limit forces the selection method to be smart, not just prolific.
The metric 10@k means: generate k total samples, filter and select 10 for submission, and check if any of those 10 passes all hidden tests. The complementary metric pass@k assumes all k samples can be submitted (an upper bound). The gap between 10@k and pass@k reveals how much room remains for better selection methods.
Filtering on example tests is a necessary but insufficient step. The example tests are small and few (1-2 per problem on average). They catch blatantly wrong solutions but miss subtle bugs that only appear on larger inputs or edge cases. That is why you need the next step: clustering.
You have thousands of programs that pass the example tests. You need to pick 10 that are as different as possible. How do you define "different"?
Not by looking at the code. Two programs can look completely different (one uses recursion, the other uses iteration) but produce identical outputs on every input. Conversely, two programs that look nearly identical might differ on one edge case.
AlphaCode's answer: cluster programs by their behavior — specifically, by the outputs they produce on a set of test inputs.
The system needs additional test inputs beyond the public examples. It trains a separate test-input generation model using the same architecture as the main model, initialized from the same GitHub pre-trained checkpoint. This model is trained to predict test inputs from problem descriptions, using example, hidden, and generated test inputs as training data.
At test time, this model generates new inputs for unseen problems. These inputs are not guaranteed to be valid (especially for problems with complex input constraints), but even imperfect or invalid inputs are useful for distinguishing program behavior.
Clustering is not a minor optimization. It is the difference between a system that works and one that does not at high sample counts:
| Method (1B model) | 10@10k | 10@100k | 10@1M |
|---|---|---|---|
| No filtering | ~5% | ~5% | ~5% |
| Filtering only | ~14% | ~18% | ~21% |
| Filtering + clustering | ~18% | ~24% | ~28% |
| Upper bound (pass@k) | ~20% | ~29% | ~37% |
Without filtering, the solve rate is flat — more samples do not help at all. With filtering alone, it scales but slowly. Clustering closes a significant portion of the gap to the theoretical upper bound, especially at high sample counts where behavioral diversity matters most.
AlphaCode was evaluated in two settings: simulated live Codeforces contests and the CodeContests benchmark. Both tell a consistent story.
The system was evaluated on 10 recent Codeforces contests (December 2021), each with more than 5,000 participants. For each contest, AlphaCode generated samples for every problem, filtered and clustered them, and submitted the top candidates to the Codeforces judge.
| Metric | Value |
|---|---|
| Average ranking | Top 54.3% |
| Estimated Codeforces rating | 1238 |
| Percentile among active users | Top 28% |
| Average submissions per solved problem | 2.4 |
| Best single-contest ranking | Top 20.6% |
| Worst single-contest ranking | Top 95.7% |
There is high variance across contests. AlphaCode tends to solve the easier problems in each contest but occasionally solves problems rated as high as 1800 (well above its average skill level).
| System | 10@1k | 10@10k | 10@100k | 10@1M |
|---|---|---|---|---|
| AlphaCode 9B | 16.9% | 22.6% | 27.1% | 30.1% |
| AlphaCode 41B | 16.9% | 23.9% | 28.2% | 31.8% |
| 41B + clustering | 21.0% | 26.2% | 31.8% | 34.2% |
The 41B model consistently outperforms the 9B, and clustering consistently provides an improvement across all sample budgets. With 1M samples and clustering, AlphaCode solves over a third of all problems in the validation set.
A careful build-up ablation shows each enhancement contributes:
| Enhancement (cumulative) | 10@100k |
|---|---|
| No enhancements (base enc-dec) | 15.2% |
| + Masked language modeling (encoder) | 17.0% |
| + Tempering (T=0.2) | 18.7% |
| + Tag & rating conditioning | 19.3% |
| + Value conditioning & prediction | 20.2% |
| + GOLD training objective | 21.5% |
| + Clustering | 24.1% |
Starting from 15.2% and reaching 24.1%, each addition contributes a meaningful gain. Clustering alone adds nearly 3 percentage points — the single largest improvement besides scaling model size.
The headline result is impressive, but understanding how AlphaCode solves problems — and where it fails — reveals deeper insights.
| Problem Tag | Solve Rate (41B, 10@10k) | Relative Strength |
|---|---|---|
| Sortings | 33.8% | Strong |
| Greedy | 25.0% | Strong |
| Math | 28.2% | Strong |
| Bitmasks | 20.4% | Good |
| Brute Force | 15.7% | Moderate |
| Constructive | 14.9% | Moderate |
| Graphs | 16.5% | Moderate |
| DP | 8.8% | Weak |
| Data Structures | 13.6% | Moderate |
AlphaCode is strongest on problems that require greedy algorithms, sorting, and direct math — areas where a straightforward approach often works. It is weakest on dynamic programming, which requires planning a multi-step recurrence and is one of the hardest categories even for humans.
A natural concern: maybe AlphaCode just memorizes solutions from training. The authors conducted a thorough analysis:
A natural question: how often does the model even produce code that compiles? Python samples are mostly syntactically correct (the language is forgiving). C++ is harder, and smaller models produce more syntax errors. But this matters less than you might think — samples with syntax errors are trivially filtered out in the first stage, and the million-sample budget easily absorbs the loss.
Experiments modifying problem descriptions reveal that AlphaCode genuinely reads and depends on the full problem statement:
| Description Change | Solve Rate | Interpretation |
|---|---|---|
| Original | 17.1% | Baseline |
| Opposite problem | 0.1% | Model is not keyword-matching |
| Related but different | 3.2% | Understands problem structure |
| Verbose rewording | 19.4% | Robust to paraphrasing |
| Simplified (remove narrative) | 15.7% (from 3.0%) | Narrative is a real obstacle |
| 7 synonyms substituted | 12.5% | Minor impact |
| Remove spec section | 4.8% | I/O format is critical |
The "opposite problem" result (0.1%) is particularly telling. If the model were just matching keywords like "prime" to stored solutions about primes, it would still produce solutions. Instead, it correctly follows the (reversed) requirements and produces outputs that are wrong for the original problem — evidence of genuine comprehension.
One of the paper's more subtle findings: the validation language modeling loss starts increasing after ~50k fine-tuning steps (indicating overfitting), but the solve rate continues to improve. The hypothesis: the model reallocates probability mass from atypical solutions toward more typical ones. This hurts average loss but increases the chance that a sampled solution is correct.
AlphaCode introduced the "generate massively, then filter" paradigm for code generation. Since its publication, this idea has echoed across multiple follow-up systems.
Brown et al. generalized AlphaCode's core insight: given enough samples, even weaker models can solve hard problems. They showed that coverage (the probability that at least one sample is correct) scales as a sigmoid function of the log sample count, and that different problems have different "critical sample counts" where coverage transitions from near-zero to near-one. This work formalized what AlphaCode demonstrated empirically.
DeepMind's follow-up, powered by Gemini, reached top 15% on Codeforces (estimated rating ~1650). Key improvements: better base model (Gemini vs. custom transformer), fine-tuning with a policy-value approach, and a more sophisticated filtering pipeline.
While AlphaCode generates solutions from scratch and filters, AlphaEvolve takes a different approach: it evolves solutions using LLMs as mutation operators in an evolutionary loop. Instead of generating a million independent samples, AlphaEvolve iteratively improves a population of solutions. Where AlphaCode relies on brute-force sampling, AlphaEvolve relies on directed search.
Took AlphaCode's scaling insight further: by scaling inference-time compute (more samples, longer generations), they pushed frontier models to solve 57.4% of SWE-bench Verified problems. CodeMonkeys showed that AlphaCode's approach extends beyond competitive programming to real-world software engineering tasks.
Reinforcement Learning from Execution Feedback uses program execution results as the reward signal for RL fine-tuning. Where AlphaCode uses execution to filter at inference time, RLEF uses execution to improve the model at training time. The two approaches are complementary: a model trained with RLEF would need fewer samples to reach the same solve rate.
For all its success, AlphaCode has clear weaknesses:
AlphaCode's lasting contribution is not any single architectural choice. It is the demonstration that verification is the bottleneck, not generation. In domains where you can automatically check solutions — by running code, checking proofs, evaluating fitness functions — the right strategy is to generate at scale and let the verifier do the work. This insight now drives approaches across code generation, mathematical reasoning, and scientific discovery.