KernelBench — Veanors

Chapter 0: The Problem

You train a transformer model. Under the hood, PyTorch dispatches hundreds of GPU operations — matrix multiplies, softmax, layer norm, GELU activations. Each one is implemented as a kernel: a function that runs in parallel across thousands of GPU threads.

The quality of these kernels determines whether your training run takes 3 days or 3 weeks. A well-written kernel can be 10x faster than a naive one for the same mathematical operation. The difference comes down to how cleverly you use the GPU's memory hierarchy, parallelism, and specialized hardware like tensor cores.

Here's the problem: writing good GPU kernels is brutally hard. It requires deep knowledge of CUDA, GPU architecture, memory access patterns, thread scheduling, and hardware-specific instructions. FlashAttention, the attention kernel that made modern transformers practical, took 5 years to appear after the Transformer was published. Porting it to the next GPU generation (NVIDIA Hopper) took another 2 years.

The bottleneck is human expertise. There's a Cambrian explosion of new ML architectures — Mamba, RWKV, mixture-of-experts — but each one needs custom kernels to run efficiently. The kernel engineers can't keep up. What if language models could write these kernels automatically?

This is the question KernelBench asks. Not "can LLMs write CUDA code that compiles?" but something much harder: can LLMs write CUDA code that is both correct AND faster than what PyTorch already provides?

PyTorch's built-in kernels aren't toy implementations. They call into cuBLAS, cuDNN, and other heavily optimized closed-source libraries that NVIDIA has been refining for years. Beating them means the LLM must discover genuine algorithmic improvements — operator fusion, memory tiling, exploiting sparsity — not just produce compilable code.

Why is writing efficient GPU kernels so challenging that it motivates using LLMs?

It requires deep knowledge of CUDA, memory hierarchies, and hardware-specific instructions — and the demand for new kernels far outpaces the supply of expert kernel engineers GPUs are too slow for modern ML workloads PyTorch doesn't support GPU acceleration

Chapter 1: The Key Insight

KernelBench is an open-source benchmark with 250 carefully selected PyTorch ML workloads. The design mirrors how a real kernel engineer works:

Input: A PyTorch Model class with __init__ and forward() methods, plus functions specifying input tensor shapes and dtypes.
Output: A new ModelNew class that produces identical results but runs faster, using any optimization strategy the LLM chooses — CUDA, Triton, CUTLASS, or any other approach.
Evaluation: Automated correctness checking (compare outputs on random inputs) and wall-clock performance measurement.

Here's what a typical task looks like. The LLM receives a PyTorch reference and must produce an optimized replacement:

python — Input: PyTorch Reference
class Model(nn.Module):
    # Simple matrix multiply C = A * B
    def forward(self, A, B):
        return torch.matmul(A, B)

M, N, K = 256, 256, 131072
def get_inputs():
    return [torch.randn(M, K), torch.randn(K, N)]

python + CUDA — Output: LLM-Generated Optimized Kernel
matmul_source = """
__global__ void matmul_kernel(const float* A,
    const float* B, float* C, int M, int N, int K) {
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  if (row < M && col < N) {
    float sum = 0.0f;
    for (int k = 0; k < K; ++k)
      sum += A[row*K + k] * B[k*N + col];
    C[row*N + col] = sum;
  }
}
"""
class ModelNew(nn.Module):
    def forward(self, A, B):
        return self.matmul.matmul_cuda(A, B)

The naive kernel above is actually slower than PyTorch. It doesn't use tiling, shared memory, or tensor cores. A real win requires the LLM to discover these optimizations on its own — that's what makes KernelBench hard. The LLM isn't just translating PyTorch to CUDA; it must engineer a faster implementation.

The benchmark is designed so that progress directly translates to real-world impact. Every task corresponds to an actual ML workload. If an LLM can beat the PyTorch baseline on a convolution kernel, that kernel can immediately be used in production training runs.

What does the LLM receive as input in a KernelBench task?

A PyTorch Model class with forward() method and input tensor specifications — the LLM must produce a ModelNew class that is both correct and faster Raw CUDA code to debug A natural language description of the desired kernel

Chapter 2: The fast_p Metric

How do you score an LLM on kernel writing? You need to capture two things simultaneously: is the kernel correct, and is it fast? A kernel that produces wrong answers is useless no matter how fast it is. A kernel that's correct but slower than PyTorch is also useless.

KernelBench introduces the fast_p metric. It measures the fraction of tasks where the LLM-generated kernel is both functionally correct and achieves a speedup greater than threshold p over the PyTorch baseline:

fast_p = (1/N) ∑_i=1..N 1(correct_i ∧ speedup_i > p)

Where speedup_i = T_Model / T_ModelNew — the ratio of PyTorch baseline time to generated kernel time.

The threshold p is the key knob:

fast₀: Any correct kernel counts, regardless of speed. This measures raw correctness rate.
fast_1.0: The kernel must be correct AND at least as fast as PyTorch. This is the primary metric.
fast_2.0: Correct AND 2x faster than PyTorch. Much harder.
fast_5.0: Correct AND 5x faster. Requires genuinely clever optimizations.

Why both axes matter: In the paper's experiments, many models produce kernels that are "almost correct" — they compile and run but produce slightly wrong numerical outputs. Other kernels are perfectly correct but 10x slower because the LLM wrote a naive loop instead of using tiling. fast_p captures this by requiring BOTH correctness and performance in a single number.

The correctness check compares outputs on 5 randomly generated inputs. The performance check runs 100 timed iterations after 3 warmup runs, using CUDA events for precise GPU timing. The coefficient of variation is consistently below 3%, so the mean is a reliable measure.

What does fast_1.0 = 12% mean for a given model?

12% of the 250 tasks produced a kernel that was both functionally correct AND at least as fast as PyTorch's built-in implementation 12% of kernels compiled successfully The model achieved a 12% speedup on average

Chapter 3: Workload Design

The 250 tasks are organized into three difficulty levels based on the number of primitive operations they contain. Each level tests a different aspect of kernel engineering:

Level 1 — Single Operations (100 tasks)

Individual building blocks of AI: matrix multiplies, convolutions, activations (GELU, ReLU, Softmax), normalizations (LayerNorm, BatchNorm), and losses (CrossEntropy, MSE). These are the atoms of ML computation.

The catch: PyTorch already calls heavily optimized (often closed-source) kernels like cuBLAS for these. Beating the baseline means the LLM must match years of hand-tuned NVIDIA engineering. But if it succeeds, the resulting open-source kernels are immediately valuable.

Level 2 — Operator Sequences (100 tasks)

Chains of 3-6 operations that commonly appear together: a convolution followed by ReLU and bias, or a matmul followed by softmax and dropout. The key opportunity here is operator fusion — combining multiple operations into a single kernel to avoid reading from and writing to slow global memory between each step.

Compiler tools like torch.compile already do some fusion automatically. The question is whether LLMs can discover more aggressive fusions than the compiler.

Level 3 — Full Architectures (50 tasks)

Complete ML models from popular GitHub repositories: AlexNet, MiniGPT, ResNet variants, transformer blocks. These require the LLM to identify which operations to optimize and how to compose optimizations across an entire architecture. This is the closest to real-world kernel engineering.

Why these specific tasks? Every task maps to real ML workloads. The Level 1 tasks cover the operators that consume >90% of GPU time in typical training runs. Level 2 captures the most common fusion patterns. Level 3 includes architectures people actually deploy. Success on KernelBench means faster real training runs, not just benchmark points.

Level	Tasks	Operations per Task	What It Tests	Baseline Challenge
L1	100	1	Single-op kernel writing	cuBLAS, cuDNN
L2	100	3-6	Operator fusion	torch.compile
L3	50	Many	Full architecture optimization	Both + algorithmic

Why is Level 2 (operator sequences) a particularly interesting test for LLMs?

Because the main optimization opportunity is operator fusion — combining multiple operations into one kernel to reduce memory I/O — which requires the LLM to reason about memory access patterns, not just translate syntax Because Level 2 tasks are the easiest Because Level 2 tasks don't require CUDA knowledge

Chapter 4: Evaluation Framework

Here's the complete pipeline that every LLM-generated kernel goes through. This is the machinery that turns "can an LLM write a kernel?" into a precise, repeatable measurement.

1. PyTorch Model + Task Spec

Reference Model class with forward(), plus get_inputs() and get_init_inputs() specifying tensor shapes and types

↓

2. LLM Generation

LLM receives Model + one-shot example, outputs ModelNew with custom CUDA/Triton code via greedy decoding

↓

3. Compilation

PyTorch's load_inline compiles inline CUDA → nvcc compiler errors caught here

↓

4. Correctness Check

Run Model and ModelNew on 5 random inputs, compare output tensors for equivalence

↓

5. Performance Benchmark

3 warmup iterations, then 100 timed runs using CUDA events. Record mean, std, min, max

↓

6. Compute fast_p

speedup = T_Model / T_ModelNew. If correct AND speedup > p → count as success

The evaluation runs on a bare-metal NVIDIA L40S GPU (Ada Lovelace, 48 GB HBM, 300W) with Python 3.10, PyTorch 2.5.0, and CUDA 12.4. Only one kernel runs at a time — no interference from other CUDA processes.

Why bare metal matters: GPU timing is notoriously noisy in cloud environments where multiple tenants share hardware. Running on bare metal with exclusive GPU access ensures that timing measurements reflect actual kernel performance, not scheduling noise. The <3% coefficient of variation confirms this setup produces reliable measurements.

There are three categories of failure, and they cascade:

Failure Category 1: Execution Errors

The kernel doesn't even run. This includes CUDA/nvcc compile-time errors (syntax mistakes, undefined functions), CUDA memory violations (out-of-bounds access, unaligned memory), and Python runtime errors. Reasoning models like o1 and DeepSeek-R1 produce significantly fewer of these.

Failure Category 2: Correctness Errors

The kernel runs but produces wrong outputs. This includes shape mismatches (output tensor has wrong dimensions) and value mismatches (numerically incorrect results). These are harder to fix because the error messages are less informative — "your output differs from reference" tells you less than "undefined variable on line 42."

Failure Category 3: Performance Failures

The kernel is correct but slower than PyTorch. This is the most common outcome even for correct kernels — the LLM produces a valid implementation that misses the optimizations that make PyTorch's built-in kernels fast.

In the KernelBench evaluation pipeline, what are the three cascading failure categories?

Execution errors (won't compile/run), correctness errors (wrong output), and performance failures (correct but slower than PyTorch) Syntax errors, logic errors, and timeout errors GPU memory errors, CPU errors, and network errors

Chapter 5: Model Results

The paper evaluates six frontier and open-source models in a one-shot setting: the LLM sees one example of a PyTorch-to-CUDA transformation (a simple add operator) and then must generate optimized kernels for each of the 250 tasks.

The headline finding is stark: even the best models match PyTorch's baseline speed on fewer than 20% of tasks.

Model	L1 fast₁	L2 fast₁	L3 fast₁	Type
GPT-4o	4%	5%	0%	Standard
OpenAI o1	10%	24%	12%	Reasoning
DeepSeek V3	6%	4%	8%	Standard
DeepSeek R1	12%	36%	2%	Reasoning
Claude 3.5 Sonnet	10%	7%	2%	Standard
Llama 3.1 70B	3%	0%	0%	Open-source
Llama 3.1 405B	3%	0%	2%	Open-source

Key observations

Reasoning models dominate. DeepSeek R1 and OpenAI o1 consistently outperform standard models. R1 achieves 36% on Level 2 — far ahead of any other model. The extended "thinking" these models do appears to help with the structured, multi-step reasoning that kernel optimization requires.

Level 2 is the sweet spot. R1 gets 36% on L2 but only 12% on L1 and 2% on L3. Level 2 tasks involve operator sequences where fusion is the natural optimization — and fusion is a well-documented pattern that LLMs have likely seen in training data. Level 1 tasks require beating cuBLAS (very hard). Level 3 tasks require holistic architecture optimization (too complex).

Open-source models struggle badly. Llama 3.1 70B scores 0% on both L2 and L3. Even the 405B variant barely improves. CUDA is only 0.073% of The Stack v1.2 training corpus — these models simply haven't seen enough kernel code.

The sobering reality: Most LLM-generated kernels are not just slower — they're wrong. Looking at fast₀ (correctness only, ignoring speed), even the best models produce correct kernels for fewer than 50% of tasks. The majority of failures are execution errors: the code doesn't even compile. Writing CUDA that compiles, runs correctly, AND beats PyTorch is a very high bar.

Which model achieves the highest fast₁ on Level 2 tasks, and why might Level 2 be its strongest category?

DeepSeek R1 at 36% — Level 2 tasks involve operator sequences where fusion is the natural optimization, a well-documented pattern the reasoning model can discover through extended chain-of-thought GPT-4o at 24% — it has the most CUDA training data Llama 405B at 12% — because it's the largest model

Chapter 6: Iterative Refinement

A human kernel engineer doesn't write a perfect kernel on the first try. They compile, check for errors, profile, and iterate. KernelBench's framework enables the same workflow for LLMs by feeding back three types of signals:

G (Generation): The LLM's previous code attempt.
E (Execution feedback): Compiler errors, CUDA memory violations, correctness check results, and wall-clock timing.
P (Profiler output): PyTorch profiler data showing per-operator timing breakdowns.

The paper compares two test-time strategies with a fixed budget of 10 inference calls:

Repeated Sampling

Generate 10 independent kernel attempts with high temperature. Pick the best one (correct and fastest). This is a brute-force search — no learning between attempts.

Result: DeepSeek V3 goes from 4% to 37% on Level 2 with 100 samples. But for tasks where the model has near-zero probability of generating a correct solution (like 34 convolution variants in L1), even 100 attempts change nothing.

Iterative Refinement

Generate one kernel, feed back execution results and profiler data, let the LLM refine. Repeat for 10 turns. Each turn builds on the previous attempt's feedback.

Result: DeepSeek R1 with G+E+P feedback goes from 36% to 72% on Level 2 — doubling its one-shot performance. On Level 1, it jumps from 12% to 43%.

Method	R1 L1	R1 L2	R1 L3
One-shot baseline	12%	36%	2%
Iterative w/ G only	18%	44%	4%
Iterative w/ G+E	41%	62%	12%
Iterative w/ G+E+P	43%	72%	18%

Execution feedback is the game-changer. Adding E (execution results) on top of G jumps R1's L2 score from 44% to 62%. Adding P (profiler) on top of that adds another 10 points to 72%. The model can fix compilation errors and memory violations when it sees the error messages. But correctness errors are harder to fix — "your output differs" is much less actionable than "undefined variable on line 42." After 10 turns of refinement, R1 produces a functional kernel on >90% of L1 and L2 tasks, but most remaining failures are correctness issues, not execution errors.

Iterative refinement wins in 5 of 6 comparisons against repeated sampling at the same inference budget. But both methods are fundamentally limited by the base model's capability. Llama 70B barely improves with either approach, while R1 improves dramatically. The feedback helps, but only if the model can use it.

What is the biggest improvement from iterative refinement, and what type of feedback drives it?

DeepSeek R1 on Level 2 goes from 36% to 72% — execution feedback (E) is the biggest driver because it gives actionable error messages that the model can use to fix compilation and runtime failures GPT-4o on Level 1 goes from 4% to 50% Profiler feedback (P) alone drives all improvement

Chapter 7: Analysis

Let's look at what LLMs actually do when they succeed — and why they fail when they fail.

What LLMs get right: successful optimizations

The paper identifies several categories of genuine optimizations that LLMs discover:

Operator fusion. The most common success pattern. LLMs fuse GELU computation into a single kernel (2.9x speedup), fuse Softsign (1.3x), and fuse matmul + division + summation + scaling (2.6x). Fusion reduces the number of round-trips to slow global memory — instead of writing intermediate results and reading them back, everything happens in fast registers.

Memory hierarchy exploitation. Some kernels use shared memory effectively — cosine similarity (2.8x) and triplet margin loss (2.0x). Shared memory is ~100x faster than global memory but limited to 48KB per streaming multiprocessor. Using it correctly requires careful tiling.

Algorithmic insight. The most impressive case: for dense-matrix times diagonal-matrix, one model recognized that it can simply scale each row by the diagonal element instead of performing a full matmul. This yields a 13x speedup by reducing O(n³) to O(n²).

The 13x diagonal matrix optimization is genuinely clever. A diagonal matrix is mostly zeros, so a full matmul wastes enormous compute multiplying by zero. The LLM recognized this structure and replaced the operation with a simple row-scaling loop. This is the kind of algorithmic insight that separates expert kernel engineers from average ones.

What LLMs get wrong: failure patterns

Execution failures dominate. For standard models (GPT-4o, Claude, Llama), over 70% of generated kernels have execution errors. Even reasoning models fail on >45%. The most common: CUDA syntax errors, incorrect thread/block dimension calculations, and memory access violations.

Correctness is the hard ceiling. After 10 turns of refinement, R1 can compile-and-run >90% of kernels. But functional correctness remains stuck — many kernels produce numerically wrong results. The error messages for correctness failures ("output mismatch") are much less actionable than compile errors.

Tensor core instructions are a black hole. When provided with hardware information, R1 attempts to use warp matrix multiply-accumulate (wmma) instructions for ~50% of matrix multiply tasks. Almost none compile successfully. Tensor cores are critical for real-world performance but their programming model is extremely specific — wrong alignment, wrong data types, or wrong tile sizes cause silent failures.

Hardware generalization is poor. Kernels that beat PyTorch on L40S may not beat it on A10G or H100. R1's L2 fast₁ varies from 36% (L40S) to 47% (A10G) — the same kernels perform differently on different hardware. Optimal kernels are hardware-specific.

What is the most impressive algorithmic optimization an LLM discovered in KernelBench?

Replacing a full dense × diagonal matrix multiply with row-wise scaling, reducing O(n³) to O(n²) for a 13x speedup Using tensor cores for all matrix operations Replacing PyTorch with TensorFlow

Chapter 8: The Optimization Gap

Let's put the KernelBench results in context. The gap between what LLMs produce and what expert kernel engineers achieve is enormous — and understanding why illuminates what LLMs need to improve.

What human experts do that LLMs don't

Tensor core utilization. FlashAttention achieves 50-75% of peak GPU throughput by carefully orchestrating tensor core wmma instructions with precise memory layouts. LLMs attempt wmma but almost never get the alignment, tile sizes, and data flow right. Tensor cores are the single biggest lever for performance on modern GPUs, and LLMs can't use them.

Asynchronous execution. Expert kernels overlap computation with memory transfers using CUDA streams and asynchronous copies. This hides memory latency — while one warp computes on data in shared memory, another warp is loading the next tile from global memory. No LLM-generated kernel in the paper uses this technique.

Hardware-aware tiling. The optimal tile size depends on the specific GPU: how much shared memory per SM, how many registers per thread, the warp size. Expert kernels are tuned for the target hardware. When LLMs were given hardware specs (H100 memory sizes, bandwidths, TFLOPS), they rarely produced hardware-specific code. A few o1 and R1 kernels used hardware info to generate 2x+ speedups, but these were outliers.

Recomputation strategies. Sometimes it's faster to recompute an intermediate value than to store and reload it from memory. FlashAttention does this with the softmax normalization factor. This requires understanding the compute-vs-memory tradeoff specific to the hardware — a level of reasoning LLMs haven't demonstrated.

The data scarcity problem. CUDA code is only 0.073% of The Stack v1.2, a popular open-source code training corpus. Compare this to Python (~15%) or JavaScript (~20%). LLMs have seen millions of Python functions but only thousands of CUDA kernels. And the highest-quality kernels — the ones in cuBLAS and cuDNN — are closed-source, so LLMs can't learn from them at all.

In-context demonstrations: a partial fix

When given three carefully chosen examples (GeLU fusion, tiled matmul, minimal FlashAttention), LLMs attempt more aggressive optimizations. OpenAI o1's generations are 25% longer on average. On 77% of GEMM variants, o1 applies tiling (though still slower than PyTorch due to missing tensor core utilization). On 11 Level 2 problems, o1 uses shared memory I/O management to beat PyTorch Eager.

But there's a catch: more ambitious code means more bugs. Overall fast₁ actually decreases with few-shot examples because the LLMs attempt optimizations they can't implement correctly. Among correct solutions, quality improves. But the error rate goes up.

The path forward

The paper identifies three directions:

More CUDA training data. Fine-tuning on high-quality kernel code (open-sourcing FlashAttention-style implementations) would directly address the data scarcity.
Higher-level abstractions. Instead of generating raw CUDA, LLMs could target Triton, CUTLASS, or ThunderKittens — higher-level APIs that handle tensor core scheduling and memory management. This reduces the kernel-writing problem to a pattern the LLM is more likely to get right.
Agentic workflows. The iterative refinement results (36% → 72% on L2) suggest that LLMs benefit enormously from feedback loops. More sophisticated agents that can profile, hypothesize, and refine could push the frontier further.

Why do in-context examples of optimization techniques (tiling, fusion) decrease overall fast₁ despite improving the quality of correct solutions?

The LLMs attempt more aggressive optimizations they can't implement correctly, increasing execution failures — the correct solutions are better, but fewer solutions are correct The examples confuse the LLMs The examples are incorrect

Chapter 9: Connections

KernelBench sits at the intersection of two rapidly evolving fields: LLM code generation and GPU kernel optimization. Here's how it connects to the broader landscape.

Related benchmarks

Benchmark	Focus	KernelBench Difference
HumanEval / MBPP	Algorithmic correctness	KB tests wall-clock performance, not just correctness
SWE-bench	Real GitHub issues	KB focuses on performance optimization, not bug fixing
RE-Bench	R&D capabilities	KB is domain-specific to GPU kernels
ECCO	Algorithmic efficiency	KB measures wall-clock time, not asymptotic complexity

The evolving landscape (post-publication)

AlphaEvolve (DeepMind, 2025) uses LLMs in an evolutionary loop to discover novel algorithms. Applied to GPU kernels, this approach could combine the search capabilities of evolution with LLMs' ability to propose plausible code modifications.

CodeMonkeys and similar agentic systems give LLMs the ability to iteratively compile, test, and profile code in a sandbox. KernelBench's iterative refinement results (36% → 72%) suggest this direction is promising.

RLEF (Reinforcement Learning from Execution Feedback) trains models to use execution results as reward signals. KernelBench's automated evaluation pipeline is a natural fit for this approach — correctness and speedup can be directly used as reward.

Triton and higher-level DSLs are becoming more popular for kernel writing. Future KernelBench evaluations could allow LLMs to target Triton instead of raw CUDA, potentially making the benchmark more tractable while maintaining real-world relevance.

A self-improving benchmark. Unlike many benchmarks that saturate and become obsolete, KernelBench is designed to remain relevant. The fast_p threshold can be increased as models improve. The PyTorch baseline can be updated to include torch.compile. New workloads can be added as new architectures emerge. And because every task maps to a real ML workload, progress on KernelBench directly means faster real-world training.

The bigger picture

KernelBench asks a question that goes beyond benchmarks: can LLMs become effective tools for systems-level programming where correctness and performance both matter? The current answer is "not yet, but the trajectory is promising." Iterative refinement doubles or triples one-shot performance. Reasoning models significantly outperform standard ones. And the few successful optimizations (13x diagonal matrix, 2.9x GELU fusion) show that LLMs can discover genuine algorithmic insights.

The bottleneck isn't intelligence — it's experience. LLMs haven't seen enough CUDA, haven't practiced enough kernel debugging, and don't have enough exposure to the optimization patterns that human experts internalize over years. Closing that gap — through better training data, better abstractions, and better feedback loops — is the path to LLMs that can actually write efficient GPU kernels.

What makes KernelBench resistant to benchmark saturation compared to most coding benchmarks?

The adjustable fast_p threshold can be raised, the baseline can be updated (e.g., to torch.compile), and new workloads can be added as new ML architectures emerge It has 250 tasks which is more than most benchmarks It uses NVIDIA GPUs which are always getting faster

KernelBench: Can LLMs Write Efficient GPU Kernels?