Simple, Fast, and Adorable AI Kernels. A tile-level DSL that makes GPU register files a first-class programming abstraction — matching FlashAttention-3 on attention while delivering 8–14× speedups on emerging architectures like linear attention and state-space models.
You have a GPU with 989 TFLOPS of BF16 tensor-core throughput. That is nearly a petaflop of raw compute. And yet your attention kernel runs at 600 TFLOPS — barely 60% of what the hardware can do. Where does the other 40% go?
It goes to waiting. Waiting for data to arrive from memory. Waiting for bank conflicts to resolve. Waiting for loads to complete before the next multiply can begin. The arithmetic units are fast enough; the problem is feeding them.
This problem has a name: the memory wall. Modern GPUs are so compute-dense that the bottleneck is almost never the math — it is moving data from where it lives to where it needs to be. The NVIDIA H100 can perform 989 trillion BF16 operations per second, but its main memory (HBM) can only deliver 3 TB/s. A single BF16 matrix multiply of two 16×16 tiles requires 1024 bytes of input and produces 512 bytes of output. At 3 TB/s, that data transfer takes about 0.5 nanoseconds. But the tensor cores can finish the multiply in about 0.04 nanoseconds. The compute is 12× faster than the memory.
Every generation of GPU widens this gap. The A100 had 2 TB/s HBM bandwidth and 312 TFLOPS of BF16 tensor-core compute. The H100 has 3 TB/s bandwidth (1.5× more) but 989 TFLOPS (3.2× more compute). Bandwidth grows linearly; compute grows quadratically with transistor count. The wall gets taller every generation.
The solution is not to buy more bandwidth. The solution is to organize your computation so that data stays close to the compute units — in registers and shared memory — and moves through the memory hierarchy as few times as possible. This is what FlashAttention did for attention. ThunderKittens generalizes the approach to any kernel by making the register file a first-class programming abstraction.
Writing efficient GPU kernels today means choosing between three painful options:
| Tool | Abstraction Level | Performance | Pain |
|---|---|---|---|
| Raw CUDA | Thread-level | Optimal (if you're an expert) | Thousands of lines, months of tuning |
| CUTLASS / CuTe | Template meta-programming | Near-optimal | 22 MB of nested templates, steep learning curve |
| Triton | Block-level compiler | Good for simple kernels | Compiler opaque — can't control register placement |
ThunderKittens proposes a fourth option: program at the tile level. Instead of thinking about individual threads, warps, or compiler-generated code, you write operations on 16×16 matrix tiles that map directly to tensor-core instructions. The tiles live in registers. The operations are PyTorch-like. And the resulting code is 10–40 lines instead of hundreds.
| Library | Size |
|---|---|
| CuBLAS | 689 MB |
| CUTLASS | 22 MB |
| Triton | 12.6 MB |
| ThunderKittens | <1.0 MB |
Less than one megabyte. That is the entire ThunderKittens library — a header-only C++ template library that gives you register-level control with a clean, readable API. The authors report that undergraduates with no prior CUDA experience wrote competitive kernels in weeks.
The name "ThunderKittens" reflects the philosophy: these are small kernels (kittens, not lions) that are fast (thunder). The library trades generality for simplicity — it targets tile-based linear algebra workloads on NVIDIA Hopper GPUs, not arbitrary GPU programs. By narrowing the scope, it can make the common case trivial while still covering the vast majority of ML kernel needs.
The rest of this lesson will take you through the ThunderKittens design from the bottom up: the GPU memory hierarchy it exploits, the tile abstractions it provides, the three levels of parallelism it manages, and the concrete kernels that demonstrate its power.
To understand ThunderKittens, you need to understand the four levels of memory on a modern GPU. Each level trades capacity for bandwidth — the closer to the compute units, the faster but smaller the memory.
Here are the exact numbers for the NVIDIA H100 SXM:
| Level | Capacity | Bandwidth | Latency | Scope |
|---|---|---|---|---|
| HBM (High Bandwidth Memory) | 80 GB | 3 TB/s | ~400 cycles | Global — all SMs |
| L2 Cache | 50 MB | 12 TB/s | ~200 cycles | Global — all SMs |
| SRAM (Shared Memory) | 228 KB / SM | 33 TB/s | ~30 cycles | Per thread-block |
| Registers | 256 KB / SM | >100 TB/s | 1 cycle | Per thread |
Look at that bandwidth column. From HBM to registers, bandwidth increases by more than 30×. A byte sitting in a register is accessed in a single cycle with effectively infinite bandwidth. A byte in HBM takes 400 cycles to arrive and must share a 3 TB/s bus with every other SM on the chip.
Data flows from HBM through the cache hierarchy to registers. The bars show relative bandwidth at each level. Hover over each level to see details. Click "Animate" to watch a tile's journey from HBM to a tensor-core multiply.
Registers are the fastest memory on the chip, but they come with a hard constraint: each thread can use at most 255 registers on the H100. If your kernel needs more, the compiler spills registers to L1 cache — which is an order of magnitude slower. This is a cliff, not a slope. Exceeding 255 registers per thread means your carefully-tuned inner loop suddenly has cache-miss latency in every iteration.
ThunderKittens manages this constraint explicitly. It provides a warpgroup::increase_registers<232>() call for compute-heavy warps and warpgroup::decrease_registers<40>() for load-focused warps. By specializing warps, each one stays within its register budget while collectively the thread-block uses the full register file.
Shared memory (SRAM) sits between registers and HBM. It is programmer-managed — you explicitly load data into it and explicitly read from it. On the H100, each streaming multiprocessor has 228 KB of shared memory, organized into 32 physical banks.
Bank conflicts are the hidden killer. When two threads in the same warp access different addresses in the same bank, the accesses are serialized. A 32-way bank conflict means your shared memory load takes 32× longer than expected. Naive row-major layouts for 16×64 BF16 tiles cause 8-way bank conflicts. ThunderKittens uses swizzled layouts that XOR address bits to distribute accesses across banks, achieving zero bank conflicts.
The H100 has 132 streaming multiprocessors (SMs). Each SM has 4 physical execution units that can run simultaneously: a load/store unit, an ALU, an FMA unit, and a tensor core. This means four different types of instructions can execute in parallel within a single SM — but only if your code is structured to exploit this parallelism.
Each SM can host up to 64 software warps (2048 threads). But more warps means each warp gets fewer registers. This is the occupancy tradeoff: high occupancy hides latency (more warps to switch between while waiting for memory), but low occupancy gives each warp more registers to work with. ThunderKittens lets you control this tradeoff explicitly through its block-level template.
The conventional CUDA wisdom is "maximize occupancy." But ThunderKittens shows this is wrong for compute-bound kernels. Their experiments reveal an inverted-U relationship: performance increases with occupancy up to a point, then decreases as register pressure causes spills. The LCSF template expands the Pareto frontier — at any given occupancy level, a pipelined LCSF kernel outperforms a non-pipelined one because it hides latency through software pipelining rather than hardware warp switching.
The sweet spot for attention on H100 is 2–3 compute warpgroups per SM with 4 pipeline stages. This gives enough warps to keep the SM busy during barrier waits, while giving each compute warp enough registers (232) to hold all tile data without spills. Going to 4 warpgroups would reduce registers to ~170 per thread and force tile sizes down, losing more performance to smaller tiles than is gained from higher occupancy.
Let's make the memory wall concrete with a simple matrix multiply C = A × B, where A is M×K and B is K×N. The total arithmetic is 2·M·K·N FLOPs (one multiply and one add per output element, for each of K inner-product terms). The total data is (M·K + K·N + M·N) elements to read/write.
For a square matrix M = N = K = 4096 in BF16 (2 bytes each):
| Quantity | Value |
|---|---|
| FLOPs | 2 × 4096³ = 137 billion |
| Bytes (BF16) | 2 × 3 × 4096² = 100 MB |
| Arithmetic intensity | 137B / 100M = 1,365 FLOPs/byte |
| H100 compute/BW ratio | 989 TFLOPS / 3 TB/s = 330 FLOPs/byte |
The required intensity (1,365) exceeds the hardware's compute-to-bandwidth ratio (330), so this problem is compute-bound — the math is the bottleneck, not the memory. Good. But now consider the attention inner loop for sequence length 4096 and head dimension 64:
| Quantity | Value |
|---|---|
| QKT FLOPs per tile | 2 × 16 × 64 × 16 = 32K |
| Bytes to load K tile (16×64 BF16) | 2,048 |
| Arithmetic intensity (QKT only) | 32K / 2K = 16 FLOPs/byte |
16 FLOPs per byte. That is 20× below the H100's compute-to-bandwidth ratio. If you load K from HBM for every tile multiply, the tensor cores sit idle 95% of the time. The only way to make attention compute-bound is to load K into shared memory once and reuse it across many query rows — and to keep the attention scores and output accumulators in registers so they never touch HBM at all. This is exactly what ThunderKittens (and FlashAttention) do.
Here is the core design insight of ThunderKittens: the 16×16 matrix tile is the right abstraction for GPU programming.
Why 16×16? Because that is the native size of the H100's tensor-core instruction. When you call mma_AB(C, A, B, C) in ThunderKittens, it compiles down to a single HMMA instruction that multiplies a 16×16 tile of A by a 16×16 tile of B and accumulates into C. No loops, no thread-level indexing, no register shuffling. One tile operation = one hardware instruction.
ThunderKittens defines three types of tiles, one for each memory level:
rt_bf<16, 64> = 16 rows, 64 cols of BF16.st_bf<64, 64> = 64×64 BF16 tile with swizzled layout.gl<bf16, -1, -1, -1, 64> = dynamic batch/head/seq, static d=64.ThunderKittens tiles are C++ templates parameterized by type, height, and width. The naming convention encodes the data type directly:
| Declaration | Memory | Type | Shape |
|---|---|---|---|
rt_bf<16, 64> | Registers | BFloat16 | 16 × 64 |
rt_fl<16, 16> | Registers | Float32 | 16 × 16 |
st_bf<64, 64> | Shared | BFloat16 | 64 × 64 |
st_fl<64, 64> | Shared | Float32 | 64 × 64 |
Register tiles also carry a layout tag — either row-major or column-major. This is critical because tensor-core instructions require specific layouts: mma_AB needs A in row-major and B in column-major. ThunderKittens tracks layouts at compile time and raises errors if you try to multiply incompatible tiles. No silent correctness bugs from layout mismatches.
cuda using namespace kittens; // Register tiles — one per warp rt_bf<16, 64> q_reg; // query: 16 rows, 64 cols, BF16, row-major rt_bf<16, 64> k_reg; // key tile rt_fl<16, 16> att; // attention scores: float32 for precision rt_bf<16, 64> o_reg; // output accumulator // Shared tiles — visible to all warps in the block st_bf<64, 64> k_smem[2]; // double-buffered key tiles st_bf<64, 64> v_smem[2]; // double-buffered value tiles // Global layout descriptor — 4D: [batch, head, seq, dim] gl<bf16, -1, -1, -1, 64> globals_k; // last dim static
A critical detail: register tiles are owned by warps, not threads. A warp is 32 threads executing in lockstep. When you declare rt_bf<16, 64>, the 16×64 = 1,024 BF16 elements are distributed across the 32 threads of the warp — each thread holds 32 elements (64 bytes, using 16 of its 255 available 32-bit registers). The distribution pattern is dictated by the tensor-core instruction format, and ThunderKittens manages it automatically.
This is why layout matters at compile time. A row-major rt_bf<16, 64> distributes elements so that the tensor-core mma_AB instruction can read matrix A directly from the warp's registers. A column-major layout distributes elements for matrix B. Passing a row-major tile as B would mean the hardware reads the wrong elements — a silent correctness bug. ThunderKittens' compile-time layout tags prevent this entirely.
rt_bf<16, 64> tile costs 16 registers per thread (1024 BF16 elements / 32 threads = 32 BF16 elements per thread = 16 32-bit registers). An attention warp holding q_reg (16 regs), att (8 regs for 16×16 FP32), o_reg (16 regs), plus softmax state (max_vec, sum_vec, ~4 regs) uses about 44 registers just for tile data. Add loop counters, pointers, and temporaries, and you need ~60–80 registers per thread. With the H100's 255-register limit, you can fit the entire attention inner loop in registers — but barely. This is why register budget management (Chapter 5) matters.Real kernels need tiles bigger than 16×16. ThunderKittens handles this through composition: a rt_bf<16, 64> is internally four 16×16 tiles laid out contiguously in registers. A st_bf<64, 64> is sixteen 16×16 tiles in shared memory. Operations on larger tiles decompose into operations on the constituent 16×16 subtiles — the compiler unrolls this at compile time, producing exactly the right sequence of tensor-core instructions.
At the warpgroup level (4 warps working collectively), tiles can be even larger. A warpgroup can collectively own a 64×64 register tile, with each of its 4 warps holding a 16×64 slice. The warpgroup::mma_AB() operation coordinates all 4 warps to perform a 64×64 matrix multiply using their collective registers.
The composition is purely mechanical: a 64×64 MMA decomposes into 4×4 = 16 underlying 16×16 HMMAs, one per warp per subtile. The compiler unrolls this at compile time — at runtime, you get a flat sequence of 16 tensor-core instructions with no loop overhead. This means the cost of using larger tiles is exactly proportional to their area, with no additional overhead for coordination or indexing.
The global layout descriptor (gl) is the bridge between ThunderKittens' tile world and PyTorch's tensor world. You create a gl from a raw pointer and shape information, and TMA uses it to fetch 2D tiles from a 4D tensor. The last dimension is typically static (head dimension d=64 or d=128), while batch, head, and sequence dimensions are dynamic. This lets a single compiled kernel handle variable batch sizes and sequence lengths without recompilation.
ThunderKittens provides four categories of operations on tiles. The API is deliberately PyTorch-like — if you know PyTorch, you already know the verbs.
Move data between memory levels:
cuda // Register <-> Shared memory load(k_reg, k_smem[subtile]); // shared -> register store(k_smem[subtile], k_reg); // register -> shared // Shared memory <-> HBM (via TMA — Tensor Memory Accelerator) tma::load_async(k_smem, globals.k, // HBM -> shared (async) {batch, head, iter, 0}, barrier); tma::store_async(globals.o, o_smem); // shared -> HBM (async) // Copy between register tiles copy(att_bf16, att_fl32); // type conversion + copy
The tma::load_async calls use the H100's Tensor Memory Accelerator — a dedicated hardware unit that handles address generation and data movement without consuming any compute resources. While TMA loads data from HBM to shared memory, the tensor cores can continue multiplying tiles that were loaded in the previous iteration.
TMA is a Hopper-generation feature. On A100, async loads required the programmer to manage the address generation manually. On H100, TMA handles the entire load pipeline: computing the address from a 4D index (batch, head, sequence, dimension), performing bounds checking, and transferring data directly to shared memory — all without any thread involvement. This frees up warps to do useful compute instead of acting as data movers.
The tma::expect() call tells the barrier how many bytes to expect, so the waiting warps know when the load is complete. This producer-consumer pattern is the foundation of the LCSF template (Chapter 5).
The crown jewel. These map directly to tensor-core instructions:
cuda // C = A * B + C (accumulate into C) mma_AB(C, A, B, C); // A row-major, B column-major mma_ABt(C, A, B, C); // A row-major, B row-major (B transposed internally) // Warpgroup-level: 4 warps collectively multiply larger tiles warpgroup::mm_ABt(att, q_smem, k_smem); // 64x64 multiply warpgroup::mma_AB(o, att_bf16, v_smem); // 64x64 accumulate warpgroup::mma_async_wait(); // wait for completion
mma_AB requires A in row-major and B in column-major layout. If you pass tiles with wrong layouts, ThunderKittens raises a compile-time error. This catches bugs that in raw CUDA would silently produce wrong results and take days to debug.The swap_layout_inplace() function converts a tile between row-major and column-major without moving data — it just reinterprets how registers map to matrix elements. This is free at runtime but changes the compile-time layout tag:
cuda load(v_reg, v_smem[subtile]); // loads as row-major auto &v_col = swap_layout_inplace(v_reg); // now column-major (free!) mma_AB(o_reg, att_mma, v_col, o_reg); // compiles: B is column-major
Element-wise operations on register tiles, using the general-purpose ALU (not tensor cores):
cuda exp(att, att); // element-wise exponential multiply(scaled, att, 0.125); // scalar multiply zero(att); // fill with zeros max(result, a, b); // element-wise max
These are essential for softmax — you need row-wise max and row-wise sum:
cuda row_max(max_vec, att); // max of each row -> vector sub_row(att, att, max_vec); // subtract row max (for numerical stability) exp(att, att); // exponentiate row_sum(sum_vec, att); // sum of each row -> vector div_row(att, att, sum_vec); // normalize each row
These five lines are the entire softmax computation. In raw CUDA, softmax requires careful warp-shuffle reductions, shared-memory atomics, and multi-pass algorithms. In ThunderKittens, it is five function calls that read like pseudocode.
Under the hood, row_max and row_sum use warp-shuffle instructions to reduce across the 32 threads that collectively hold a row of the tile. Each thread holds a subset of elements; the shuffle reduction collects the partial results without going through shared memory. ThunderKittens generates these shuffles automatically from the tile layout — the user never sees them.
The sub_row and div_row operations broadcast a row vector (one value per row) across all columns of the tile. Again, the broadcast pattern depends on which thread holds which element, and ThunderKittens handles this automatically based on the tile's compile-time layout.
The four categories of tile operations and which hardware units execute them. Click a category to highlight its data path through the memory hierarchy.
swap_layout_inplace() actually do at runtime?The simplest ThunderKittens kernel operates at the warp level: one warp, operating on its own register tiles, calling tile operations. This is where you start. Let's build an attention inner loop from scratch.
Recall the attention formula: Attention(Q, K, V) = softmax(QKT / √d) · V. For a single query row q and a block of keys K and values V, the computation is:
In ThunderKittens, this entire inner loop — the hottest code in all of deep learning — is 11 lines:
cuda using namespace kittens; rt_bf<16, 64> k_reg, v_reg; // Load K tile from shared memory load(k_reg, k_smem[subtile]); // QK^T: compute attention scores zero(att); mma_ABt(att, q_reg, k_reg, att); // Softmax sub_row(att, att, max_vec); exp(att, att); div_row(att, att, norm_vec); // Convert to BF16 for the V multiply copy(att_mma, att); // Attend to values: O += att * V load(v_reg, v_smem[subtile]); auto &v_reg_col = swap_layout_inplace(v_reg); mma_AB(o_reg, att_mma, v_reg_col, o_reg);
Let's trace the data flow step by step:
Notice what doesn't happen: no data goes to HBM. No data even goes to shared memory during the core computation. Everything stays in registers. The only shared memory accesses are the initial loads of K and V, which can be pipelined with the previous iteration's compute.
Let's count exactly what the hardware does for this inner loop. For one iteration (one K tile, one V tile):
| Operation | Hardware Unit | Instructions |
|---|---|---|
load(k_reg, k_smem) | Load/Store unit | ~4 LDS instructions (loading 16×64 BF16 from SRAM) |
mma_ABt(att, q, k, att) | Tensor Core | 4 HMMA instructions (16×64 × 64×16 = four 16×16 multiplies) |
sub_row + exp + div_row | ALU / FMA | ~12 instructions (operating on 16×16 = 256 elements / 32 threads = 8 per thread) |
copy(att_mma, att) | Register rename | ~4 CVT instructions (FP32 to BF16) |
load(v_reg, v_smem) | Load/Store unit | ~4 LDS instructions |
mma_AB(o, att, v, o) | Tensor Core | 4 HMMA instructions |
The key observation: the load/store, ALU, and tensor-core instructions use different hardware units. They can execute in parallel. While the tensor core is doing the mma_ABt, the load/store unit can be prefetching the next V tile. While the ALU computes softmax, the tensor core might be finishing a multiply from the previous cycle. This instruction-level parallelism is what makes ThunderKittens' tile abstraction map so efficiently to hardware.
FlashAttention-2 implements the same tiling strategy — keeping Q in registers, streaming K and V through shared memory. But FA2 is written in raw CUDA with manual warp-level matrix operations, explicit register allocation, and hand-tuned shared memory layouts. The FA2 forward kernel is hundreds of lines of intricate CUDA. The ThunderKittens version is 11 lines of tile operations that a compiler can verify for correctness.
And the performance? On the H100, ThunderKittens' attention forward kernel matches FlashAttention-3 — the state-of-the-art hand-tuned implementation that uses Hopper-specific features like TMA and warp specialization. ThunderKittens uses those same features, but exposes them through clean abstractions instead of raw PTX.
This equivalence is not an accident. FA3 and ThunderKittens use the same algorithmic approach (tiled attention with online softmax), the same hardware features (TMA, warp specialization, register reallocation), and the same memory hierarchy strategy (Q in registers, K/V streaming through SRAM). The only difference is the programming model: FA3 is hundreds of lines of intricate CUDA with hand-managed PTX intrinsics; ThunderKittens is tens of lines of tile operations. Same performance, dramatically less code, dramatically lower bug surface.
A single warp can only use one tensor core. But an SM has multiple tensor cores, and a thread-block can contain multiple warp groups. To saturate the hardware, we need to coordinate multiple warps working on different parts of the problem while sharing data through shared memory.
ThunderKittens provides a block-level programming template called LCSF: Load, Compute, Store, Finish. You fill in four functions, and ThunderKittens handles all the synchronization, barrier management, and warp specialization.
The key idea is warp specialization: different warps in the same thread-block do different jobs. Some warps are "producers" (loading data), others are "consumers" (computing on data). This lets loads and computes overlap in time — while the compute warps are multiplying the current tile, the load warps are already fetching the next tile from HBM.
cuda // === LOAD function (producer warp) === if(warpgroup::warpid() == 0) { // Tell TMA how many bytes to expect tma::expect(inputs_arrived, block.k, block.v); // Fire off async loads from HBM -> shared memory tma::load_async(block.k, globals.k, {batch, head, iter, 0}, inputs_arrived); tma::load_async(block.v, globals.v, {batch, head, iter, 0}, inputs_arrived); } else arrive(inputs_arrived); // non-load warps just signal
cuda // === COMPUTE function (consumer warp groups) === warpgroup::mm_ABt(att, scratch.q[state.id], block.k); warpgroup::mma_async_wait(); // Online softmax sub_row(att, att, max_vec); exp(att, att); div_row(att, att, norm_vec); copy(att_bf16, att); // Accumulate O += softmax(QK^T) * V warpgroup::mma_AB(state.o, att_bf16, block.v); warpgroup::mma_async_wait(); arrive(inputs_finished);
Different warps need different resources. Load warps barely touch registers — they just issue TMA commands and wait on barriers. Compute warps need as many registers as possible to hold Q tiles, attention tiles, output accumulators, and softmax state. ThunderKittens lets you explicitly reallocate registers between warps:
cuda // Load warp: give up registers (only need 40) warpgroup::decrease_registers<40>(); // Compute warp: claim extra registers (up to 232) warpgroup::increase_registers<232>();
This is a Hopper-specific feature (the setmaxnreg PTX instruction). On A100, you cannot dynamically reallocate registers, so all warps must share the same budget. On H100, ThunderKittens exploits this to give compute warps nearly the full 255-register limit while keeping load warps lean.
The math works out cleanly. A thread-block with 1 load warpgroup (128 threads at 40 regs each = 5,120 regs) and 2 compute warpgroups (256 threads at 232 regs each = 59,392 regs) uses a total of 64,512 registers. The H100 has 65,536 registers per SM (256 KB / 4 bytes), so this configuration uses 98.4% of the register file. Almost nothing is wasted.
Without warp specialization, all 384 threads would share the register budget equally: 65,536 / 384 = 170 registers per thread. The compute warps would have 170 instead of 232 — a 27% reduction in register capacity, meaning smaller tiles, more spills, and worse performance. Warp specialization is not just a convenience; it is a fundamental enabler of high tile occupancy.
The LCSF template supports multi-stage pipelining. With N stages, you can have N tiles in flight simultaneously: stage 0 is being computed, stage 1 is being loaded, stage 2's load just finished, etc. The effect on performance is dramatic:
| Pipeline Stages | TFLOPS (GEMM, M=N=K=4096) | % of Peak |
|---|---|---|
| 1 (no pipelining) | 260 | 26% |
| 2 | 484 | 49% |
| 3 | 683 | 69% |
| 4 | 760 | 77% |
Here is a timeline view of what happens with 4-stage pipelining. Each row is a pipeline stage, and time flows left to right:
timeline Time --> t0 t1 t2 t3 t4 t5 -------- -------- -------- -------- -------- -------- Stage 0: LOAD COMPUTE STORE Stage 1: LOAD COMPUTE STORE Stage 2: LOAD COMPUTE STORE Stage 3: LOAD COMPUTE STORE
At time t1, stage 0 is computing while stage 1 is loading. At t2, stage 0 is storing, stage 1 is computing, and stage 2 is loading. The tensor cores are busy at every time step after the pipeline fills. Without pipelining (1 stage), the tensor cores sit idle during every load and store phase.
The final level of the ThunderKittens hierarchy is the grid — how thread-blocks are scheduled across the 132 SMs of the H100. Two optimizations at this level have outsized impact: persistent grids and block launch order.
In a normal CUDA launch, the GPU creates one thread-block per output tile. When a block finishes, the hardware scheduler launches a new one. This launch overhead is significant for small tiles — setting up a new block (allocating shared memory, initializing barriers, loading constants) can take thousands of cycles.
A persistent grid launches exactly as many blocks as there are SMs (132 on H100) and keeps them alive for the entire kernel. Each block processes multiple output tiles by iterating through a work queue. No launch overhead, no context switching.
| K | TK (no persist) | TK (persist) | CuBLAS |
|---|---|---|---|
| 64 | 93 TFLOPS | 108 TFLOPS | 69 TFLOPS |
| 128 | 161 | 184 | 133 |
| 256 | 271 | 309 | 242 |
| 512 | 414 | 450 | 407 |
For skinny matrices (small K), persistent grids give a 16% speedup. Even for larger K, the persistent version consistently outperforms. And both TK configurations beat CuBLAS at small K — a 40-line kernel outperforming NVIDIA's flagship library.
Why does CuBLAS struggle at small K? At K=64 with M=N=4096, the problem is tiny: 2 × 4096 × 64 × 4096 = 2.1 billion FLOPs. On a 989 TFLOPS GPU, that should take about 2 microseconds. But launching 4096/128 × 4096/128 = 1,024 thread-blocks takes overhead that is a significant fraction of 2 microseconds. CuBLAS launches fresh blocks; ThunderKittens' persistent grid already has 132 blocks running — they just pick up the next work item from a queue. Zero launch overhead.
This matters in practice because many ML workloads involve sequences of small GEMMs (e.g., MLP layers with small batch sizes during inference, or the K and V projections in attention with small sequence lengths). CuBLAS is tuned for large square matrices; ThunderKittens performs well across the entire size spectrum.
The order in which blocks process output tiles determines how well the L2 cache is utilized. Consider a GEMM C = A × B where C is tiled into a grid. If blocks iterate in row-major order {N, M}, each new block needs a completely different column of B. But if blocks iterate in a swizzled order {8, N, M/8}, neighboring blocks share columns of B, which stay hot in L2.
| Kernel | Block Order | HBM BW | TFLOPS |
|---|---|---|---|
| GEMM 16K³ | {8, N, M/8} (swizzled) | 982 GB/s | 805 |
| GEMM 16K³ | {N, M} (naive) | 3,070 GB/s | 392 |
| Attention | {N, H, B} (optimal) | 213 GB/s | 600 |
| Attention | {B, H, N} (naive) | 2,390 GB/s | 494 |
ThunderKittens is organized around three principles, one for each level:
Each principle addresses a different bottleneck: register utilization (warp), latency hiding (block), and cache reuse (grid). Together, they let a 40-line kernel achieve 80%+ of theoretical peak.
For attention with batch size B, H heads, and sequence length N, each output tile is indexed by (b, h, n). The block launch order determines which tiles are processed by adjacent blocks on the same SM cluster — and therefore which input tiles remain in L2.
With order {B, H, N}: block 0 processes (0, 0, 0), block 1 processes (0, 0, 1), etc. Consecutive blocks along N share the same Q, K, V tensors for the same (b, h) pair. The K and V tiles from the current sequence position stay hot in L2 for the next block that processes the next query position of the same head.
With order {N, H, B}: block 0 processes (0, 0, 0), block 1 processes (0, 1, 0) — a completely different head. The K, V tiles from head 0 are evicted before they can be reused. Every block reads fresh data from HBM.
The difference: 213 GB/s HBM traffic (good order) vs 2,390 GB/s (bad order). That is 11× more HBM bandwidth wasted on re-reading data that could have stayed in L2. And the TFLOPS difference (600 vs 494) directly reflects this: more time waiting for HBM = less time computing.
Now let's see how all three levels — warp, block, grid — combine in a complete attention kernel. This is the showpiece: a kernel that matches FlashAttention-3 on forward pass and outperforms all baselines by 10–40% on the backward pass.
The attention forward pass computes O = softmax(QKT / √d) · V for sequences of length N with head dimension d. ThunderKittens tiles the computation:
The load warpgroup fires off TMA loads for the next K and V tiles while the compute warpgroups are still working on the current tiles. By the time the compute finishes, the next tiles are already in shared memory. This is the pipelining from Chapter 5 in action.
The softmax in attention must be computed over the entire sequence length, but we only see one K tile at a time. ThunderKittens uses the online softmax trick (the same one FlashAttention uses): maintain a running maximum and running sum, and correct the output accumulator whenever a new maximum is found.
cuda // For each K tile in the sequence: mma_ABt(att, q_reg, k_reg, att); // QK^T for this tile // Update running max row_max(new_max, att); max(new_max, new_max, old_max); // Correction factor for previous accumulator sub_row(correction, old_max, new_max); exp(correction, correction); mul_row(o_reg, o_reg, correction); // rescale previous output // Softmax current tile sub_row(att, att, new_max); exp(att, att); row_sum(tile_sum, att); // Accumulate mma_AB(o_reg, att_bf16, v_col, o_reg); // Update running state copy(old_max, new_max); add(running_sum, running_sum, tile_sum);
The attention backward pass is harder to optimize than the forward pass because it requires recomputing the attention matrix and has more complex data dependencies. Here is where ThunderKittens' advantage is most visible:
| Metric | FlashAttention-3 | ThunderKittens | Advantage |
|---|---|---|---|
| Tensor core utilization | 61.2% | 58.2% | FA3 slightly higher |
| HBM bandwidth used | 328 GB/s | 490 GB/s | TK 1.5× more efficient |
| Shared memory stall cycles | 1.83 | 0.14 | TK 13× fewer stalls |
| Overall throughput | Baseline | +10–40% | TK faster end-to-end |
Before attention, let's look at the simpler case: general matrix multiply (GEMM). ThunderKittens' GEMM kernel is just 40 lines of code using the LCSF template. Here is how it performs against CuBLAS (NVIDIA's flagship library, 689 MB of hand-tuned code):
| Matrix Size (M=N=K) | TK (persistent) | CuBLAS | TK vs CuBLAS |
|---|---|---|---|
| 4096 | 767 TFLOPS | 735 TFLOPS | +4% |
| 8192 | ~790 TFLOPS | ~780 TFLOPS | ~equal |
| 16384 | 805 TFLOPS | ~800 TFLOPS | ~equal |
A 40-line kernel matching a 689 MB library. At 16K×16K×16K, ThunderKittens achieves 805 TFLOPS — 81% of the H100's 989 TFLOPS theoretical peak. The remaining 19% is spent on shared memory loads, register-to-register copies, and softmax-like reductions. This is near the hardware limit.
For skinny matrices (small K), ThunderKittens actually beats CuBLAS by significant margins (up to 57% at K=64) because the persistent grid avoids block launch overhead that dominates at small problem sizes.
For autoregressive models, attention is causal — token i can only attend to tokens j ≤ i. ThunderKittens handles this through the tile scheduling: tiles where all keys come after the query position are simply skipped (zero attention). Tiles on the diagonal (partially causal) apply a mask within the tile. The tiling structure means most of the computation is either fully causal (skip) or fully non-causal (no mask needed), minimizing wasted work.
To be concrete: for a 4096-token sequence with 64×64 tiles, the attention matrix is 64×64 = 4,096 tiles. Of these, about half are below the diagonal (fully non-causal: compute normally), about half are above (fully causal: skip entirely), and 64 are on the diagonal (partially masked: apply the causal mask within the tile). Only 64 out of 4,096 tiles need masking logic — less than 2% of the total work. The overhead of causal masking is negligible when the tile size is chosen well.
ThunderKittens' tile abstraction makes this optimization automatic. The grid scheduler skips tiles above the diagonal, and the warp-level code applies the mask only for diagonal tiles. In raw CUDA, implementing this skip-logic efficiently requires careful bookkeeping; in ThunderKittens, it is a coordinate comparison in the grid loop.
Shows how data flows through the three ThunderKittens levels during one iteration of the attention forward pass. Producer warps (teal) load the next K,V tiles while consumer warps (warm) compute on the current tiles.
Matching FlashAttention-3 on standard attention is impressive, but the real value of ThunderKittens is on emerging architectures — models that don't have hand-tuned CUDA kernels. When a new architecture is published, the first implementation is usually in PyTorch (slow), then maybe Triton (better), and eventually someone spends months writing a CUDA kernel (fast). ThunderKittens compresses that last step from months to days.
Linear attention replaces the softmax(QKT) with a kernel function φ(Q)φ(K)T that can be computed in O(N) instead of O(N²). The trick is that the multiplication order changes: instead of computing the N×N attention matrix, you compute φ(K)TV first (a d×d matrix), then multiply by φ(Q).
The catch is that the d×d intermediate state must be maintained across the sequence as a running sum — this is the "state" in "state-space model." Maintaining this state efficiently requires careful register management, which is exactly what ThunderKittens provides.
| Variant | Baseline | ThunderKittens | Speedup |
|---|---|---|---|
| Polynomial feature map | Flash Linear Attention | TK kernel | 14× |
| Learned feature map | Flash Linear Attention | TK kernel | 6.5× |
14× is not a typo. The Flash Linear Attention baseline is already an optimized kernel, not naive PyTorch. The speedup comes from ThunderKittens' ability to keep the d×d state in registers and pipeline the sequential updates through the LCSF template.
Why is the state-maintenance so hard to optimize? In linear attention, you maintain a running state matrix S (d×d) that is updated at every time step: St = St-1 + φ(kt) · vtT. Then the output is ot = φ(qt) · St. For d=64, S is 64×64 = 4,096 elements in FP32 = 16 KB. That is too large for one warp's registers (which hold at most ~8 KB of tile data), so you need warpgroup-level collective ownership.
In Triton, you cannot express this: the compiler decides where S lives, and it typically puts it in shared memory, adding 30-cycle latency to every state access. In ThunderKittens, you explicitly declare S as a register tile distributed across a warpgroup, and every update and query is a register-to-register operation at 1-cycle latency. This 30× latency reduction on the most frequently accessed data structure explains most of the 14× speedup.
Mamba-2 is a state-space model that maintains a hidden state and updates it at each time step. The update involves matrix multiplies on the state, which maps naturally to ThunderKittens tiles:
| Kernel | Baseline | ThunderKittens | Speedup |
|---|---|---|---|
| Mamba-2 forward | Triton kernel | TK kernel | >3× |
The Triton baseline is already GPU-optimized. ThunderKittens' advantage comes from register-level control that Triton's compiler cannot achieve — Triton operates at the block level and cannot control which data lives in registers vs shared memory.
Mamba-2's selective scan involves a recurrence: ht = At · ht-1 + Bt · xt, where At is a diagonal state transition matrix and ht is the hidden state. The diagonal structure means At · ht-1 is an element-wise multiply (a pointwise operation in ThunderKittens), while Bt · xt is an outer product (a single MMA). The sequential nature of the recurrence means the state must persist in registers across time steps — any spill to SRAM or HBM adds latency to every step in the chain.
ThunderKittens keeps the entire state in register tiles, uses the LCSF template to pipeline the input loads, and processes chunks of time steps in parallel where the recurrence allows. The result: >3× faster than the best Triton implementation, with the same numerical results.
Some architectures (Hyena, S4) use long convolutions computed via FFT. The FFT involves complex multiplications on tiles of data that fit naturally into ThunderKittens' 16×16 abstraction:
| Sequence Length | FlashFFTConv | ThunderKittens | Speedup vs FlashFFTConv | Speedup vs PyTorch FFT |
|---|---|---|---|---|
| 1024 | Baseline | TK kernel | 7.9× | — |
| 4096 | Baseline | TK kernel | 4.7× | 8.7× |
The profiling data tells the story of why ThunderKittens is faster:
| Metric | FlashFFTConv | ThunderKittens |
|---|---|---|
| Tensor core utilization | 13.4% | 54.8% |
| HBM bandwidth | 14.8 GB/s | 31.4 GB/s |
The FFT kernel is particularly interesting because FFT is not traditionally thought of as a tensor-core workload. The key insight is that the FFT butterfly operations can be expressed as matrix multiplies on small tiles — the Monarch matrix decomposition factors the FFT into a sequence of block-diagonal matrix multiplies, each of which maps perfectly to ThunderKittens' 16×16 tiles. The TK FFT kernel uses 64×64 tiles with 4 pipeline stages (INPUT=4, OUTPUT=2), with multiple consumer warpgroups processing different frequency blocks in parallel.
This is a general lesson: many algorithms that seem inherently sequential or non-matmul can be reformulated as tile-based linear algebra. Once reformulated, ThunderKittens gives them near-peak performance almost for free. The library's value is not just in the kernels it ships, but in the thinking framework it promotes: decompose your algorithm into 16×16 tile operations, and the hardware will do the rest.
The speedups on emerging architectures (3–14×) are much larger than on standard attention (~1×). This is not because the architectures are easier to optimize — it is because the baselines are weaker. Standard attention has FlashAttention-2 and FA3: kernels that took years of expert CUDA engineering. Emerging architectures have Triton kernels or PyTorch implementations that no one has spent months hand-tuning.
The Triton compiler operates at the block level — it sees tiles of data but cannot control which tiles live in registers vs shared memory. For standard attention, Triton's block-level view happens to produce good code because the access pattern is regular. For state-space models, where the computation involves sequential state updates with complex data dependencies, Triton's compiler makes suboptimal choices about register allocation and memory placement.
ThunderKittens closes this gap by giving the programmer explicit register control with a clean API. The "months of CUDA expertise" is baked into the library's abstractions (swizzled layouts, LCSF pipelining, persistent grids), so the kernel author only needs to express the algorithm's tile-level logic.
This has profound implications for ML research velocity. Today, proposing a new attention variant (linear attention, sliding window, multi-query, grouped-query) means either accepting PyTorch-speed benchmarks that understate the method's potential, or waiting months for someone to write a CUDA kernel. With ThunderKittens, the researcher can write a competitive kernel in days and benchmark against the state of the art on equal footing. The library removes the "kernel engineering" tax from architecture research.
Consider the timeline for Based (linear attention with polynomial feature maps). The paper was published, a PyTorch implementation was available, a Triton kernel followed, but the ThunderKittens kernel showed that all previous benchmarks had underestimated the method's throughput by 14×. The algorithm was always fast — the implementations just couldn't show it. How many good ideas have been rejected because their benchmark implementations were slow? ThunderKittens makes this a question we can stop asking.
Across all these emerging architectures, the story is the same: existing implementations leave 50–90% of the hardware idle because they cannot control register placement. ThunderKittens fills this gap. The library's abstractions — tiles, LCSF, persistent grids — are general enough that any tile-based computation can be expressed in tens of lines and achieve 50–80% of theoretical peak.
And the development cost? The paper reports that these kernels were written by undergraduates with no prior CUDA experience in weeks, not months. That is the true contribution of ThunderKittens: democratizing high-performance GPU programming.
ThunderKittens is a systems paper, but its implications extend far beyond GPU programming:
| Domain | Impact |
|---|---|
| ML research | New architectures no longer need to wait months for efficient kernels. Linear attention, SSMs, and hybrid models can be prototyped and benchmarked at near-peak performance in days. |
| Compiler research | ThunderKittens shows that the right abstraction level matters more than compiler sophistication. Triton's block-level compiler cannot match hand-tuned register placement; TK's tile-level abstraction can. |
| Hardware design | The paper provides a template for what future GPU programming models should look like: expose hardware primitives (tensor cores, TMA, register reallocation) through clean abstractions, not through raw PTX. |
| Education | Undergraduates writing competitive CUDA kernels. This democratizes a skill that previously required years of systems experience. |
| Fact | Number |
|---|---|
| H100 BF16 tensor core peak | 989 TFLOPS |
| H100 HBM bandwidth | 3 TB/s |
| Tensor core FLOPs vs general FP32 | 16× |
| Register limit per thread (H100) | 255 |
| GEMM TFLOPS (TK, persistent, 16K³) | 805 (81% peak) |
| Attention backward speedup vs FA3 | 10–40% |
| Linear attention speedup (polynomial) | 14× |
| FFT convolution speedup | 4.7–7.9× |
| Mamba-2 speedup vs Triton | >3× |
| ThunderKittens library size | <1 MB |
ThunderKittens is not a silver bullet. Its current limitations are worth understanding:
| Limitation | Why |
|---|---|
| NVIDIA-only | Tightly coupled to CUDA, tensor cores, and TMA. No AMD ROCm or Intel support. |
| Hopper-optimized | Best features (TMA, setmaxnreg, warpgroup MMA) require H100. A100 support exists but with fewer optimizations. |
| BF16/FP32 focus | FP8 and integer quantization support is limited. As quantized inference becomes dominant, this matters. |
| Tile-shaped problems only | Irregular computations (sparse attention, graph neural networks) may not decompose cleanly into 16×16 tiles. |
| No autotuning | Pipeline stages, warpgroup counts, and block order are manual choices. An autotuner could explore this space automatically. |
Despite these limitations, the design philosophy — expose hardware primitives through clean, composable abstractions at the right granularity — is broadly applicable. Future GPU architectures (NVIDIA Blackwell, AMD CDNA4) will have their own tile-level primitives, and the ThunderKittens approach of wrapping them in a readable DSL is likely to persist even as the specific hardware changes.
ThunderKittens sits at a very specific point on the abstraction-performance tradeoff curve:
| Tool | Abstraction Level | Controls Registers? | Controls SRAM Layout? | Controls Grid? | Typical Lines |
|---|---|---|---|---|---|
| PyTorch | Tensor | No | No | No | 5–20 |
| Triton | Block | No | Partial | No | 30–100 |
| ThunderKittens | Tile | Yes | Yes | Yes | 30–80 |
| CUTLASS | Template | Yes | Yes | Yes | 200–1000 |
| Raw CUDA/PTX | Thread | Yes | Yes | Yes | 500–5000 |
ThunderKittens gives you the same control as CUTLASS and raw CUDA — registers, shared memory layouts, grid scheduling — but in 30–80 lines instead of hundreds or thousands. The insight is that most of the complexity in CUTLASS and CUDA comes from managing thread-level details (which thread owns which element, how to synchronize warps, how to avoid bank conflicts). By lifting the abstraction to 16×16 tiles, all of that complexity is handled by the library.