Ouyang, Guo, Arora et al. — Stanford / Princeton, 2025

KernelBench: Can LLMs Write Efficient GPU Kernels?

250 real PyTorch workloads, automated correctness + performance testing, and a sobering finding: even frontier reasoning models match baseline PyTorch in fewer than 20% of cases.

Prerequisites: Basic GPU programming intuition + What PyTorch does under the hood
10
Chapters
4+
Simulations

Chapter 0: The Problem

You train a transformer model. Under the hood, PyTorch dispatches hundreds of GPU operations — matrix multiplies, softmax, layer norm, GELU activations. Each one is implemented as a kernel: a function that runs in parallel across thousands of GPU threads.

The quality of these kernels determines whether your training run takes 3 days or 3 weeks. A well-written kernel can be 10x faster than a naive one for the same mathematical operation. The difference comes down to how cleverly you use the GPU's memory hierarchy, parallelism, and specialized hardware like tensor cores.

Here's the problem: writing good GPU kernels is brutally hard. It requires deep knowledge of CUDA, GPU architecture, memory access patterns, thread scheduling, and hardware-specific instructions. FlashAttention, the attention kernel that made modern transformers practical, took 5 years to appear after the Transformer was published. Porting it to the next GPU generation (NVIDIA Hopper) took another 2 years.

The bottleneck is human expertise. There's a Cambrian explosion of new ML architectures — Mamba, RWKV, mixture-of-experts — but each one needs custom kernels to run efficiently. The kernel engineers can't keep up. What if language models could write these kernels automatically?

This is the question KernelBench asks. Not "can LLMs write CUDA code that compiles?" but something much harder: can LLMs write CUDA code that is both correct AND faster than what PyTorch already provides?

PyTorch's built-in kernels aren't toy implementations. They call into cuBLAS, cuDNN, and other heavily optimized closed-source libraries that NVIDIA has been refining for years. Beating them means the LLM must discover genuine algorithmic improvements — operator fusion, memory tiling, exploiting sparsity — not just produce compilable code.

Why is writing efficient GPU kernels so challenging that it motivates using LLMs?

Chapter 1: The Key Insight

KernelBench is an open-source benchmark with 250 carefully selected PyTorch ML workloads. The design mirrors how a real kernel engineer works:

  1. Input: A PyTorch Model class with __init__ and forward() methods, plus functions specifying input tensor shapes and dtypes.
  2. Output: A new ModelNew class that produces identical results but runs faster, using any optimization strategy the LLM chooses — CUDA, Triton, CUTLASS, or any other approach.
  3. Evaluation: Automated correctness checking (compare outputs on random inputs) and wall-clock performance measurement.

Here's what a typical task looks like. The LLM receives a PyTorch reference and must produce an optimized replacement:

python — Input: PyTorch Reference
class Model(nn.Module):
    # Simple matrix multiply C = A * B
    def forward(self, A, B):
        return torch.matmul(A, B)

M, N, K = 256, 256, 131072
def get_inputs():
    return [torch.randn(M, K), torch.randn(K, N)]
python + CUDA — Output: LLM-Generated Optimized Kernel
matmul_source = """
__global__ void matmul_kernel(const float* A,
    const float* B, float* C, int M, int N, int K) {
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  if (row < M && col < N) {
    float sum = 0.0f;
    for (int k = 0; k < K; ++k)
      sum += A[row*K + k] * B[k*N + col];
    C[row*N + col] = sum;
  }
}
"""
class ModelNew(nn.Module):
    def forward(self, A, B):
        return self.matmul.matmul_cuda(A, B)
The naive kernel above is actually slower than PyTorch. It doesn't use tiling, shared memory, or tensor cores. A real win requires the LLM to discover these optimizations on its own — that's what makes KernelBench hard. The LLM isn't just translating PyTorch to CUDA; it must engineer a faster implementation.

The benchmark is designed so that progress directly translates to real-world impact. Every task corresponds to an actual ML workload. If an LLM can beat the PyTorch baseline on a convolution kernel, that kernel can immediately be used in production training runs.

What does the LLM receive as input in a KernelBench task?

Chapter 2: The fastp Metric

How do you score an LLM on kernel writing? You need to capture two things simultaneously: is the kernel correct, and is it fast? A kernel that produces wrong answers is useless no matter how fast it is. A kernel that's correct but slower than PyTorch is also useless.

KernelBench introduces the fastp metric. It measures the fraction of tasks where the LLM-generated kernel is both functionally correct and achieves a speedup greater than threshold p over the PyTorch baseline:

fastp = (1/N) ∑i=1..N 1(correcti ∧ speedupi > p)

Where speedupi = TModel / TModelNew — the ratio of PyTorch baseline time to generated kernel time.

The threshold p is the key knob:

Why both axes matter: In the paper's experiments, many models produce kernels that are "almost correct" — they compile and run but produce slightly wrong numerical outputs. Other kernels are perfectly correct but 10x slower because the LLM wrote a naive loop instead of using tiling. fastp captures this by requiring BOTH correctness and performance in a single number.

The correctness check compares outputs on 5 randomly generated inputs. The performance check runs 100 timed iterations after 3 warmup runs, using CUDA events for precise GPU timing. The coefficient of variation is consistently below 3%, so the mean is a reliable measure.

What does fast1.0 = 12% mean for a given model?

Chapter 3: Workload Design

The 250 tasks are organized into three difficulty levels based on the number of primitive operations they contain. Each level tests a different aspect of kernel engineering:

Level 1 — Single Operations (100 tasks)

Individual building blocks of AI: matrix multiplies, convolutions, activations (GELU, ReLU, Softmax), normalizations (LayerNorm, BatchNorm), and losses (CrossEntropy, MSE). These are the atoms of ML computation.

The catch: PyTorch already calls heavily optimized (often closed-source) kernels like cuBLAS for these. Beating the baseline means the LLM must match years of hand-tuned NVIDIA engineering. But if it succeeds, the resulting open-source kernels are immediately valuable.

Level 2 — Operator Sequences (100 tasks)

Chains of 3-6 operations that commonly appear together: a convolution followed by ReLU and bias, or a matmul followed by softmax and dropout. The key opportunity here is operator fusion — combining multiple operations into a single kernel to avoid reading from and writing to slow global memory between each step.

Compiler tools like torch.compile already do some fusion automatically. The question is whether LLMs can discover more aggressive fusions than the compiler.

Level 3 — Full Architectures (50 tasks)

Complete ML models from popular GitHub repositories: AlexNet, MiniGPT, ResNet variants, transformer blocks. These require the LLM to identify which operations to optimize and how to compose optimizations across an entire architecture. This is the closest to real-world kernel engineering.

Why these specific tasks? Every task maps to real ML workloads. The Level 1 tasks cover the operators that consume >90% of GPU time in typical training runs. Level 2 captures the most common fusion patterns. Level 3 includes architectures people actually deploy. Success on KernelBench means faster real training runs, not just benchmark points.
LevelTasksOperations per TaskWhat It TestsBaseline Challenge
L11001Single-op kernel writingcuBLAS, cuDNN
L21003-6Operator fusiontorch.compile
L350ManyFull architecture optimizationBoth + algorithmic
Why is Level 2 (operator sequences) a particularly interesting test for LLMs?

Chapter 4: Evaluation Framework

Here's the complete pipeline that every LLM-generated kernel goes through. This is the machinery that turns "can an LLM write a kernel?" into a precise, repeatable measurement.

1. PyTorch Model + Task Spec
Reference Model class with forward(), plus get_inputs() and get_init_inputs() specifying tensor shapes and types
2. LLM Generation
LLM receives Model + one-shot example, outputs ModelNew with custom CUDA/Triton code via greedy decoding
3. Compilation
PyTorch's load_inline compiles inline CUDA → nvcc compiler errors caught here
4. Correctness Check
Run Model and ModelNew on 5 random inputs, compare output tensors for equivalence
5. Performance Benchmark
3 warmup iterations, then 100 timed runs using CUDA events. Record mean, std, min, max
6. Compute fastp
speedup = TModel / TModelNew. If correct AND speedup > p → count as success

The evaluation runs on a bare-metal NVIDIA L40S GPU (Ada Lovelace, 48 GB HBM, 300W) with Python 3.10, PyTorch 2.5.0, and CUDA 12.4. Only one kernel runs at a time — no interference from other CUDA processes.

Why bare metal matters: GPU timing is notoriously noisy in cloud environments where multiple tenants share hardware. Running on bare metal with exclusive GPU access ensures that timing measurements reflect actual kernel performance, not scheduling noise. The <3% coefficient of variation confirms this setup produces reliable measurements.

There are three categories of failure, and they cascade:

Failure Category 1: Execution Errors

The kernel doesn't even run. This includes CUDA/nvcc compile-time errors (syntax mistakes, undefined functions), CUDA memory violations (out-of-bounds access, unaligned memory), and Python runtime errors. Reasoning models like o1 and DeepSeek-R1 produce significantly fewer of these.

Failure Category 2: Correctness Errors

The kernel runs but produces wrong outputs. This includes shape mismatches (output tensor has wrong dimensions) and value mismatches (numerically incorrect results). These are harder to fix because the error messages are less informative — "your output differs from reference" tells you less than "undefined variable on line 42."

Failure Category 3: Performance Failures

The kernel is correct but slower than PyTorch. This is the most common outcome even for correct kernels — the LLM produces a valid implementation that misses the optimizations that make PyTorch's built-in kernels fast.

In the KernelBench evaluation pipeline, what are the three cascading failure categories?

Chapter 5: Model Results

The paper evaluates six frontier and open-source models in a one-shot setting: the LLM sees one example of a PyTorch-to-CUDA transformation (a simple add operator) and then must generate optimized kernels for each of the 250 tasks.

The headline finding is stark: even the best models match PyTorch's baseline speed on fewer than 20% of tasks.

ModelL1 fast1L2 fast1L3 fast1Type
GPT-4o4%5%0%Standard
OpenAI o110%24%12%Reasoning
DeepSeek V36%4%8%Standard
DeepSeek R112%36%2%Reasoning
Claude 3.5 Sonnet10%7%2%Standard
Llama 3.1 70B3%0%0%Open-source
Llama 3.1 405B3%0%2%Open-source

Key observations

Reasoning models dominate. DeepSeek R1 and OpenAI o1 consistently outperform standard models. R1 achieves 36% on Level 2 — far ahead of any other model. The extended "thinking" these models do appears to help with the structured, multi-step reasoning that kernel optimization requires.

Level 2 is the sweet spot. R1 gets 36% on L2 but only 12% on L1 and 2% on L3. Level 2 tasks involve operator sequences where fusion is the natural optimization — and fusion is a well-documented pattern that LLMs have likely seen in training data. Level 1 tasks require beating cuBLAS (very hard). Level 3 tasks require holistic architecture optimization (too complex).

Open-source models struggle badly. Llama 3.1 70B scores 0% on both L2 and L3. Even the 405B variant barely improves. CUDA is only 0.073% of The Stack v1.2 training corpus — these models simply haven't seen enough kernel code.

The sobering reality: Most LLM-generated kernels are not just slower — they're wrong. Looking at fast0 (correctness only, ignoring speed), even the best models produce correct kernels for fewer than 50% of tasks. The majority of failures are execution errors: the code doesn't even compile. Writing CUDA that compiles, runs correctly, AND beats PyTorch is a very high bar.
Which model achieves the highest fast1 on Level 2 tasks, and why might Level 2 be its strongest category?

Chapter 6: Iterative Refinement

A human kernel engineer doesn't write a perfect kernel on the first try. They compile, check for errors, profile, and iterate. KernelBench's framework enables the same workflow for LLMs by feeding back three types of signals:

The paper compares two test-time strategies with a fixed budget of 10 inference calls:

Repeated Sampling

Generate 10 independent kernel attempts with high temperature. Pick the best one (correct and fastest). This is a brute-force search — no learning between attempts.

Result: DeepSeek V3 goes from 4% to 37% on Level 2 with 100 samples. But for tasks where the model has near-zero probability of generating a correct solution (like 34 convolution variants in L1), even 100 attempts change nothing.

Iterative Refinement

Generate one kernel, feed back execution results and profiler data, let the LLM refine. Repeat for 10 turns. Each turn builds on the previous attempt's feedback.

Result: DeepSeek R1 with G+E+P feedback goes from 36% to 72% on Level 2 — doubling its one-shot performance. On Level 1, it jumps from 12% to 43%.

MethodR1 L1R1 L2R1 L3
One-shot baseline12%36%2%
Iterative w/ G only18%44%4%
Iterative w/ G+E41%62%12%
Iterative w/ G+E+P43%72%18%
Execution feedback is the game-changer. Adding E (execution results) on top of G jumps R1's L2 score from 44% to 62%. Adding P (profiler) on top of that adds another 10 points to 72%. The model can fix compilation errors and memory violations when it sees the error messages. But correctness errors are harder to fix — "your output differs" is much less actionable than "undefined variable on line 42." After 10 turns of refinement, R1 produces a functional kernel on >90% of L1 and L2 tasks, but most remaining failures are correctness issues, not execution errors.

Iterative refinement wins in 5 of 6 comparisons against repeated sampling at the same inference budget. But both methods are fundamentally limited by the base model's capability. Llama 70B barely improves with either approach, while R1 improves dramatically. The feedback helps, but only if the model can use it.

What is the biggest improvement from iterative refinement, and what type of feedback drives it?

Chapter 7: Analysis

Let's look at what LLMs actually do when they succeed — and why they fail when they fail.

What LLMs get right: successful optimizations

The paper identifies several categories of genuine optimizations that LLMs discover:

Operator fusion. The most common success pattern. LLMs fuse GELU computation into a single kernel (2.9x speedup), fuse Softsign (1.3x), and fuse matmul + division + summation + scaling (2.6x). Fusion reduces the number of round-trips to slow global memory — instead of writing intermediate results and reading them back, everything happens in fast registers.

Memory hierarchy exploitation. Some kernels use shared memory effectively — cosine similarity (2.8x) and triplet margin loss (2.0x). Shared memory is ~100x faster than global memory but limited to 48KB per streaming multiprocessor. Using it correctly requires careful tiling.

Algorithmic insight. The most impressive case: for dense-matrix times diagonal-matrix, one model recognized that it can simply scale each row by the diagonal element instead of performing a full matmul. This yields a 13x speedup by reducing O(n³) to O(n²).

The 13x diagonal matrix optimization is genuinely clever. A diagonal matrix is mostly zeros, so a full matmul wastes enormous compute multiplying by zero. The LLM recognized this structure and replaced the operation with a simple row-scaling loop. This is the kind of algorithmic insight that separates expert kernel engineers from average ones.

What LLMs get wrong: failure patterns

Execution failures dominate. For standard models (GPT-4o, Claude, Llama), over 70% of generated kernels have execution errors. Even reasoning models fail on >45%. The most common: CUDA syntax errors, incorrect thread/block dimension calculations, and memory access violations.

Correctness is the hard ceiling. After 10 turns of refinement, R1 can compile-and-run >90% of kernels. But functional correctness remains stuck — many kernels produce numerically wrong results. The error messages for correctness failures ("output mismatch") are much less actionable than compile errors.

Tensor core instructions are a black hole. When provided with hardware information, R1 attempts to use warp matrix multiply-accumulate (wmma) instructions for ~50% of matrix multiply tasks. Almost none compile successfully. Tensor cores are critical for real-world performance but their programming model is extremely specific — wrong alignment, wrong data types, or wrong tile sizes cause silent failures.

Hardware generalization is poor. Kernels that beat PyTorch on L40S may not beat it on A10G or H100. R1's L2 fast1 varies from 36% (L40S) to 47% (A10G) — the same kernels perform differently on different hardware. Optimal kernels are hardware-specific.

What is the most impressive algorithmic optimization an LLM discovered in KernelBench?

Chapter 8: The Optimization Gap

Let's put the KernelBench results in context. The gap between what LLMs produce and what expert kernel engineers achieve is enormous — and understanding why illuminates what LLMs need to improve.

What human experts do that LLMs don't

Tensor core utilization. FlashAttention achieves 50-75% of peak GPU throughput by carefully orchestrating tensor core wmma instructions with precise memory layouts. LLMs attempt wmma but almost never get the alignment, tile sizes, and data flow right. Tensor cores are the single biggest lever for performance on modern GPUs, and LLMs can't use them.

Asynchronous execution. Expert kernels overlap computation with memory transfers using CUDA streams and asynchronous copies. This hides memory latency — while one warp computes on data in shared memory, another warp is loading the next tile from global memory. No LLM-generated kernel in the paper uses this technique.

Hardware-aware tiling. The optimal tile size depends on the specific GPU: how much shared memory per SM, how many registers per thread, the warp size. Expert kernels are tuned for the target hardware. When LLMs were given hardware specs (H100 memory sizes, bandwidths, TFLOPS), they rarely produced hardware-specific code. A few o1 and R1 kernels used hardware info to generate 2x+ speedups, but these were outliers.

Recomputation strategies. Sometimes it's faster to recompute an intermediate value than to store and reload it from memory. FlashAttention does this with the softmax normalization factor. This requires understanding the compute-vs-memory tradeoff specific to the hardware — a level of reasoning LLMs haven't demonstrated.

The data scarcity problem. CUDA code is only 0.073% of The Stack v1.2, a popular open-source code training corpus. Compare this to Python (~15%) or JavaScript (~20%). LLMs have seen millions of Python functions but only thousands of CUDA kernels. And the highest-quality kernels — the ones in cuBLAS and cuDNN — are closed-source, so LLMs can't learn from them at all.

In-context demonstrations: a partial fix

When given three carefully chosen examples (GeLU fusion, tiled matmul, minimal FlashAttention), LLMs attempt more aggressive optimizations. OpenAI o1's generations are 25% longer on average. On 77% of GEMM variants, o1 applies tiling (though still slower than PyTorch due to missing tensor core utilization). On 11 Level 2 problems, o1 uses shared memory I/O management to beat PyTorch Eager.

But there's a catch: more ambitious code means more bugs. Overall fast1 actually decreases with few-shot examples because the LLMs attempt optimizations they can't implement correctly. Among correct solutions, quality improves. But the error rate goes up.

The path forward

The paper identifies three directions:

  1. More CUDA training data. Fine-tuning on high-quality kernel code (open-sourcing FlashAttention-style implementations) would directly address the data scarcity.
  2. Higher-level abstractions. Instead of generating raw CUDA, LLMs could target Triton, CUTLASS, or ThunderKittens — higher-level APIs that handle tensor core scheduling and memory management. This reduces the kernel-writing problem to a pattern the LLM is more likely to get right.
  3. Agentic workflows. The iterative refinement results (36% → 72% on L2) suggest that LLMs benefit enormously from feedback loops. More sophisticated agents that can profile, hypothesize, and refine could push the frontier further.
Why do in-context examples of optimization techniques (tiling, fusion) decrease overall fast1 despite improving the quality of correct solutions?

Chapter 9: Connections

KernelBench sits at the intersection of two rapidly evolving fields: LLM code generation and GPU kernel optimization. Here's how it connects to the broader landscape.

Related benchmarks

BenchmarkFocusKernelBench Difference
HumanEval / MBPPAlgorithmic correctnessKB tests wall-clock performance, not just correctness
SWE-benchReal GitHub issuesKB focuses on performance optimization, not bug fixing
RE-BenchR&D capabilitiesKB is domain-specific to GPU kernels
ECCOAlgorithmic efficiencyKB measures wall-clock time, not asymptotic complexity

The evolving landscape (post-publication)

AlphaEvolve (DeepMind, 2025) uses LLMs in an evolutionary loop to discover novel algorithms. Applied to GPU kernels, this approach could combine the search capabilities of evolution with LLMs' ability to propose plausible code modifications.

CodeMonkeys and similar agentic systems give LLMs the ability to iteratively compile, test, and profile code in a sandbox. KernelBench's iterative refinement results (36% → 72%) suggest this direction is promising.

RLEF (Reinforcement Learning from Execution Feedback) trains models to use execution results as reward signals. KernelBench's automated evaluation pipeline is a natural fit for this approach — correctness and speedup can be directly used as reward.

Triton and higher-level DSLs are becoming more popular for kernel writing. Future KernelBench evaluations could allow LLMs to target Triton instead of raw CUDA, potentially making the benchmark more tractable while maintaining real-world relevance.

A self-improving benchmark. Unlike many benchmarks that saturate and become obsolete, KernelBench is designed to remain relevant. The fastp threshold can be increased as models improve. The PyTorch baseline can be updated to include torch.compile. New workloads can be added as new architectures emerge. And because every task maps to a real ML workload, progress on KernelBench directly means faster real-world training.

The bigger picture

KernelBench asks a question that goes beyond benchmarks: can LLMs become effective tools for systems-level programming where correctness and performance both matter? The current answer is "not yet, but the trajectory is promising." Iterative refinement doubles or triples one-shot performance. Reasoning models significantly outperform standard ones. And the few successful optimizations (13x diagonal matrix, 2.9x GELU fusion) show that LLMs can discover genuine algorithmic insights.

The bottleneck isn't intelligence — it's experience. LLMs haven't seen enough CUDA, haven't practiced enough kernel debugging, and don't have enough exposure to the optimization patterns that human experts internalize over years. Closing that gap — through better training data, better abstractions, and better feedback loops — is the path to LLMs that can actually write efficient GPU kernels.

What makes KernelBench resistant to benchmark saturation compared to most coding benchmarks?