250 real PyTorch workloads, automated correctness + performance testing, and a sobering finding: even frontier reasoning models match baseline PyTorch in fewer than 20% of cases.
You train a transformer model. Under the hood, PyTorch dispatches hundreds of GPU operations — matrix multiplies, softmax, layer norm, GELU activations. Each one is implemented as a kernel: a function that runs in parallel across thousands of GPU threads.
The quality of these kernels determines whether your training run takes 3 days or 3 weeks. A well-written kernel can be 10x faster than a naive one for the same mathematical operation. The difference comes down to how cleverly you use the GPU's memory hierarchy, parallelism, and specialized hardware like tensor cores.
Here's the problem: writing good GPU kernels is brutally hard. It requires deep knowledge of CUDA, GPU architecture, memory access patterns, thread scheduling, and hardware-specific instructions. FlashAttention, the attention kernel that made modern transformers practical, took 5 years to appear after the Transformer was published. Porting it to the next GPU generation (NVIDIA Hopper) took another 2 years.
This is the question KernelBench asks. Not "can LLMs write CUDA code that compiles?" but something much harder: can LLMs write CUDA code that is both correct AND faster than what PyTorch already provides?
PyTorch's built-in kernels aren't toy implementations. They call into cuBLAS, cuDNN, and other heavily optimized closed-source libraries that NVIDIA has been refining for years. Beating them means the LLM must discover genuine algorithmic improvements — operator fusion, memory tiling, exploiting sparsity — not just produce compilable code.
KernelBench is an open-source benchmark with 250 carefully selected PyTorch ML workloads. The design mirrors how a real kernel engineer works:
Model class with __init__ and forward() methods, plus functions specifying input tensor shapes and dtypes.ModelNew class that produces identical results but runs faster, using any optimization strategy the LLM chooses — CUDA, Triton, CUTLASS, or any other approach.Here's what a typical task looks like. The LLM receives a PyTorch reference and must produce an optimized replacement:
python — Input: PyTorch Reference class Model(nn.Module): # Simple matrix multiply C = A * B def forward(self, A, B): return torch.matmul(A, B) M, N, K = 256, 256, 131072 def get_inputs(): return [torch.randn(M, K), torch.randn(K, N)]
python + CUDA — Output: LLM-Generated Optimized Kernel matmul_source = """ __global__ void matmul_kernel(const float* A, const float* B, float* C, int M, int N, int K) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if (row < M && col < N) { float sum = 0.0f; for (int k = 0; k < K; ++k) sum += A[row*K + k] * B[k*N + col]; C[row*N + col] = sum; } } """ class ModelNew(nn.Module): def forward(self, A, B): return self.matmul.matmul_cuda(A, B)
The benchmark is designed so that progress directly translates to real-world impact. Every task corresponds to an actual ML workload. If an LLM can beat the PyTorch baseline on a convolution kernel, that kernel can immediately be used in production training runs.
How do you score an LLM on kernel writing? You need to capture two things simultaneously: is the kernel correct, and is it fast? A kernel that produces wrong answers is useless no matter how fast it is. A kernel that's correct but slower than PyTorch is also useless.
KernelBench introduces the fastp metric. It measures the fraction of tasks where the LLM-generated kernel is both functionally correct and achieves a speedup greater than threshold p over the PyTorch baseline:
Where speedupi = TModel / TModelNew — the ratio of PyTorch baseline time to generated kernel time.
The threshold p is the key knob:
The correctness check compares outputs on 5 randomly generated inputs. The performance check runs 100 timed iterations after 3 warmup runs, using CUDA events for precise GPU timing. The coefficient of variation is consistently below 3%, so the mean is a reliable measure.
The 250 tasks are organized into three difficulty levels based on the number of primitive operations they contain. Each level tests a different aspect of kernel engineering:
Individual building blocks of AI: matrix multiplies, convolutions, activations (GELU, ReLU, Softmax), normalizations (LayerNorm, BatchNorm), and losses (CrossEntropy, MSE). These are the atoms of ML computation.
The catch: PyTorch already calls heavily optimized (often closed-source) kernels like cuBLAS for these. Beating the baseline means the LLM must match years of hand-tuned NVIDIA engineering. But if it succeeds, the resulting open-source kernels are immediately valuable.
Chains of 3-6 operations that commonly appear together: a convolution followed by ReLU and bias, or a matmul followed by softmax and dropout. The key opportunity here is operator fusion — combining multiple operations into a single kernel to avoid reading from and writing to slow global memory between each step.
Compiler tools like torch.compile already do some fusion automatically. The question is whether LLMs can discover more aggressive fusions than the compiler.
Complete ML models from popular GitHub repositories: AlexNet, MiniGPT, ResNet variants, transformer blocks. These require the LLM to identify which operations to optimize and how to compose optimizations across an entire architecture. This is the closest to real-world kernel engineering.
| Level | Tasks | Operations per Task | What It Tests | Baseline Challenge |
|---|---|---|---|---|
| L1 | 100 | 1 | Single-op kernel writing | cuBLAS, cuDNN |
| L2 | 100 | 3-6 | Operator fusion | torch.compile |
| L3 | 50 | Many | Full architecture optimization | Both + algorithmic |
Here's the complete pipeline that every LLM-generated kernel goes through. This is the machinery that turns "can an LLM write a kernel?" into a precise, repeatable measurement.
The evaluation runs on a bare-metal NVIDIA L40S GPU (Ada Lovelace, 48 GB HBM, 300W) with Python 3.10, PyTorch 2.5.0, and CUDA 12.4. Only one kernel runs at a time — no interference from other CUDA processes.
There are three categories of failure, and they cascade:
The kernel doesn't even run. This includes CUDA/nvcc compile-time errors (syntax mistakes, undefined functions), CUDA memory violations (out-of-bounds access, unaligned memory), and Python runtime errors. Reasoning models like o1 and DeepSeek-R1 produce significantly fewer of these.
The kernel runs but produces wrong outputs. This includes shape mismatches (output tensor has wrong dimensions) and value mismatches (numerically incorrect results). These are harder to fix because the error messages are less informative — "your output differs from reference" tells you less than "undefined variable on line 42."
The kernel is correct but slower than PyTorch. This is the most common outcome even for correct kernels — the LLM produces a valid implementation that misses the optimizations that make PyTorch's built-in kernels fast.
The paper evaluates six frontier and open-source models in a one-shot setting: the LLM sees one example of a PyTorch-to-CUDA transformation (a simple add operator) and then must generate optimized kernels for each of the 250 tasks.
The headline finding is stark: even the best models match PyTorch's baseline speed on fewer than 20% of tasks.
| Model | L1 fast1 | L2 fast1 | L3 fast1 | Type |
|---|---|---|---|---|
| GPT-4o | 4% | 5% | 0% | Standard |
| OpenAI o1 | 10% | 24% | 12% | Reasoning |
| DeepSeek V3 | 6% | 4% | 8% | Standard |
| DeepSeek R1 | 12% | 36% | 2% | Reasoning |
| Claude 3.5 Sonnet | 10% | 7% | 2% | Standard |
| Llama 3.1 70B | 3% | 0% | 0% | Open-source |
| Llama 3.1 405B | 3% | 0% | 2% | Open-source |
Reasoning models dominate. DeepSeek R1 and OpenAI o1 consistently outperform standard models. R1 achieves 36% on Level 2 — far ahead of any other model. The extended "thinking" these models do appears to help with the structured, multi-step reasoning that kernel optimization requires.
Level 2 is the sweet spot. R1 gets 36% on L2 but only 12% on L1 and 2% on L3. Level 2 tasks involve operator sequences where fusion is the natural optimization — and fusion is a well-documented pattern that LLMs have likely seen in training data. Level 1 tasks require beating cuBLAS (very hard). Level 3 tasks require holistic architecture optimization (too complex).
Open-source models struggle badly. Llama 3.1 70B scores 0% on both L2 and L3. Even the 405B variant barely improves. CUDA is only 0.073% of The Stack v1.2 training corpus — these models simply haven't seen enough kernel code.
A human kernel engineer doesn't write a perfect kernel on the first try. They compile, check for errors, profile, and iterate. KernelBench's framework enables the same workflow for LLMs by feeding back three types of signals:
The paper compares two test-time strategies with a fixed budget of 10 inference calls:
Generate 10 independent kernel attempts with high temperature. Pick the best one (correct and fastest). This is a brute-force search — no learning between attempts.
Result: DeepSeek V3 goes from 4% to 37% on Level 2 with 100 samples. But for tasks where the model has near-zero probability of generating a correct solution (like 34 convolution variants in L1), even 100 attempts change nothing.
Generate one kernel, feed back execution results and profiler data, let the LLM refine. Repeat for 10 turns. Each turn builds on the previous attempt's feedback.
Result: DeepSeek R1 with G+E+P feedback goes from 36% to 72% on Level 2 — doubling its one-shot performance. On Level 1, it jumps from 12% to 43%.
| Method | R1 L1 | R1 L2 | R1 L3 |
|---|---|---|---|
| One-shot baseline | 12% | 36% | 2% |
| Iterative w/ G only | 18% | 44% | 4% |
| Iterative w/ G+E | 41% | 62% | 12% |
| Iterative w/ G+E+P | 43% | 72% | 18% |
Iterative refinement wins in 5 of 6 comparisons against repeated sampling at the same inference budget. But both methods are fundamentally limited by the base model's capability. Llama 70B barely improves with either approach, while R1 improves dramatically. The feedback helps, but only if the model can use it.
Let's look at what LLMs actually do when they succeed — and why they fail when they fail.
The paper identifies several categories of genuine optimizations that LLMs discover:
Operator fusion. The most common success pattern. LLMs fuse GELU computation into a single kernel (2.9x speedup), fuse Softsign (1.3x), and fuse matmul + division + summation + scaling (2.6x). Fusion reduces the number of round-trips to slow global memory — instead of writing intermediate results and reading them back, everything happens in fast registers.
Memory hierarchy exploitation. Some kernels use shared memory effectively — cosine similarity (2.8x) and triplet margin loss (2.0x). Shared memory is ~100x faster than global memory but limited to 48KB per streaming multiprocessor. Using it correctly requires careful tiling.
Algorithmic insight. The most impressive case: for dense-matrix times diagonal-matrix, one model recognized that it can simply scale each row by the diagonal element instead of performing a full matmul. This yields a 13x speedup by reducing O(n³) to O(n²).
Execution failures dominate. For standard models (GPT-4o, Claude, Llama), over 70% of generated kernels have execution errors. Even reasoning models fail on >45%. The most common: CUDA syntax errors, incorrect thread/block dimension calculations, and memory access violations.
Correctness is the hard ceiling. After 10 turns of refinement, R1 can compile-and-run >90% of kernels. But functional correctness remains stuck — many kernels produce numerically wrong results. The error messages for correctness failures ("output mismatch") are much less actionable than compile errors.
Tensor core instructions are a black hole. When provided with hardware information, R1 attempts to use warp matrix multiply-accumulate (wmma) instructions for ~50% of matrix multiply tasks. Almost none compile successfully. Tensor cores are critical for real-world performance but their programming model is extremely specific — wrong alignment, wrong data types, or wrong tile sizes cause silent failures.
Hardware generalization is poor. Kernels that beat PyTorch on L40S may not beat it on A10G or H100. R1's L2 fast1 varies from 36% (L40S) to 47% (A10G) — the same kernels perform differently on different hardware. Optimal kernels are hardware-specific.
Let's put the KernelBench results in context. The gap between what LLMs produce and what expert kernel engineers achieve is enormous — and understanding why illuminates what LLMs need to improve.
Tensor core utilization. FlashAttention achieves 50-75% of peak GPU throughput by carefully orchestrating tensor core wmma instructions with precise memory layouts. LLMs attempt wmma but almost never get the alignment, tile sizes, and data flow right. Tensor cores are the single biggest lever for performance on modern GPUs, and LLMs can't use them.
Asynchronous execution. Expert kernels overlap computation with memory transfers using CUDA streams and asynchronous copies. This hides memory latency — while one warp computes on data in shared memory, another warp is loading the next tile from global memory. No LLM-generated kernel in the paper uses this technique.
Hardware-aware tiling. The optimal tile size depends on the specific GPU: how much shared memory per SM, how many registers per thread, the warp size. Expert kernels are tuned for the target hardware. When LLMs were given hardware specs (H100 memory sizes, bandwidths, TFLOPS), they rarely produced hardware-specific code. A few o1 and R1 kernels used hardware info to generate 2x+ speedups, but these were outliers.
Recomputation strategies. Sometimes it's faster to recompute an intermediate value than to store and reload it from memory. FlashAttention does this with the softmax normalization factor. This requires understanding the compute-vs-memory tradeoff specific to the hardware — a level of reasoning LLMs haven't demonstrated.
When given three carefully chosen examples (GeLU fusion, tiled matmul, minimal FlashAttention), LLMs attempt more aggressive optimizations. OpenAI o1's generations are 25% longer on average. On 77% of GEMM variants, o1 applies tiling (though still slower than PyTorch due to missing tensor core utilization). On 11 Level 2 problems, o1 uses shared memory I/O management to beat PyTorch Eager.
But there's a catch: more ambitious code means more bugs. Overall fast1 actually decreases with few-shot examples because the LLMs attempt optimizations they can't implement correctly. Among correct solutions, quality improves. But the error rate goes up.
The paper identifies three directions:
KernelBench sits at the intersection of two rapidly evolving fields: LLM code generation and GPU kernel optimization. Here's how it connects to the broader landscape.
| Benchmark | Focus | KernelBench Difference |
|---|---|---|
| HumanEval / MBPP | Algorithmic correctness | KB tests wall-clock performance, not just correctness |
| SWE-bench | Real GitHub issues | KB focuses on performance optimization, not bug fixing |
| RE-Bench | R&D capabilities | KB is domain-specific to GPU kernels |
| ECCO | Algorithmic efficiency | KB measures wall-clock time, not asymptotic complexity |
AlphaEvolve (DeepMind, 2025) uses LLMs in an evolutionary loop to discover novel algorithms. Applied to GPU kernels, this approach could combine the search capabilities of evolution with LLMs' ability to propose plausible code modifications.
CodeMonkeys and similar agentic systems give LLMs the ability to iteratively compile, test, and profile code in a sandbox. KernelBench's iterative refinement results (36% → 72%) suggest this direction is promising.
RLEF (Reinforcement Learning from Execution Feedback) trains models to use execution results as reward signals. KernelBench's automated evaluation pipeline is a natural fit for this approach — correctness and speedup can be directly used as reward.
Triton and higher-level DSLs are becoming more popular for kernel writing. Future KernelBench evaluations could allow LLMs to target Triton instead of raw CUDA, potentially making the benchmark more tractable while maintaining real-world relevance.
KernelBench asks a question that goes beyond benchmarks: can LLMs become effective tools for systems-level programming where correctness and performance both matter? The current answer is "not yet, but the trajectory is promising." Iterative refinement doubles or triples one-shot performance. Reasoning models significantly outperform standard ones. And the few successful optimizations (13x diagonal matrix, 2.9x GELU fusion) show that LLMs can discover genuine algorithmic insights.
The bottleneck isn't intelligence — it's experience. LLMs haven't seen enough CUDA, haven't practiced enough kernel debugging, and don't have enough exposure to the optimization patterns that human experts internalize over years. Closing that gap — through better training data, better abstractions, and better feedback loops — is the path to LLMs that can actually write efficient GPU kernels.