ThunderKittens — Veanors

Chapter 0: The Memory Wall

You have a GPU with 989 TFLOPS of BF16 tensor-core throughput. That is nearly a petaflop of raw compute. And yet your attention kernel runs at 600 TFLOPS — barely 60% of what the hardware can do. Where does the other 40% go?

It goes to waiting. Waiting for data to arrive from memory. Waiting for bank conflicts to resolve. Waiting for loads to complete before the next multiply can begin. The arithmetic units are fast enough; the problem is feeding them.

This problem has a name: the memory wall. Modern GPUs are so compute-dense that the bottleneck is almost never the math — it is moving data from where it lives to where it needs to be. The NVIDIA H100 can perform 989 trillion BF16 operations per second, but its main memory (HBM) can only deliver 3 TB/s. A single BF16 matrix multiply of two 16×16 tiles requires 1024 bytes of input and produces 512 bytes of output. At 3 TB/s, that data transfer takes about 0.5 nanoseconds. But the tensor cores can finish the multiply in about 0.04 nanoseconds. The compute is 12× faster than the memory.

The fundamental bottleneck: On an NVIDIA H100, BF16 tensor cores provide 16× the FLOPs of general-purpose BF16/FP32 compute. But HBM bandwidth is only 3 TB/s. The arithmetic intensity required to keep tensor cores busy is enormous — you must reuse every byte of data many times before it leaves the register file. This is the problem ThunderKittens solves.

Every generation of GPU widens this gap. The A100 had 2 TB/s HBM bandwidth and 312 TFLOPS of BF16 tensor-core compute. The H100 has 3 TB/s bandwidth (1.5× more) but 989 TFLOPS (3.2× more compute). Bandwidth grows linearly; compute grows quadratically with transistor count. The wall gets taller every generation.

The solution is not to buy more bandwidth. The solution is to organize your computation so that data stays close to the compute units — in registers and shared memory — and moves through the memory hierarchy as few times as possible. This is what FlashAttention did for attention. ThunderKittens generalizes the approach to any kernel by making the register file a first-class programming abstraction.

Why existing tools fall short

Writing efficient GPU kernels today means choosing between three painful options:

Tool	Abstraction Level	Performance	Pain
Raw CUDA	Thread-level	Optimal (if you're an expert)	Thousands of lines, months of tuning
CUTLASS / CuTe	Template meta-programming	Near-optimal	22 MB of nested templates, steep learning curve
Triton	Block-level compiler	Good for simple kernels	Compiler opaque — can't control register placement

ThunderKittens proposes a fourth option: program at the tile level. Instead of thinking about individual threads, warps, or compiler-generated code, you write operations on 16×16 matrix tiles that map directly to tensor-core instructions. The tiles live in registers. The operations are PyTorch-like. And the resulting code is 10–40 lines instead of hundreds.

Library	Size
CuBLAS	689 MB
CUTLASS	22 MB
Triton	12.6 MB
ThunderKittens	<1.0 MB

Less than one megabyte. That is the entire ThunderKittens library — a header-only C++ template library that gives you register-level control with a clean, readable API. The authors report that undergraduates with no prior CUDA experience wrote competitive kernels in weeks.

The name "ThunderKittens" reflects the philosophy: these are small kernels (kittens, not lions) that are fast (thunder). The library trades generality for simplicity — it targets tile-based linear algebra workloads on NVIDIA Hopper GPUs, not arbitrary GPU programs. By narrowing the scope, it can make the common case trivial while still covering the vast majority of ML kernel needs.

The rest of this lesson will take you through the ThunderKittens design from the bottom up: the GPU memory hierarchy it exploits, the tile abstractions it provides, the three levels of parallelism it manages, and the concrete kernels that demonstrate its power.

Why does the H100's 989 TFLOPS of BF16 compute not translate to 989 TFLOPS of actual kernel throughput?

The tensor cores have manufacturing defects that reduce effective throughput HBM bandwidth (3 TB/s) cannot feed data fast enough — compute units stall waiting for memory, so the bottleneck is data movement, not arithmetic The operating system scheduler prevents full GPU utilization

Chapter 1: The GPU Memory Hierarchy

To understand ThunderKittens, you need to understand the four levels of memory on a modern GPU. Each level trades capacity for bandwidth — the closer to the compute units, the faster but smaller the memory.

Here are the exact numbers for the NVIDIA H100 SXM:

Level	Capacity	Bandwidth	Latency	Scope
HBM (High Bandwidth Memory)	80 GB	3 TB/s	~400 cycles	Global — all SMs
L2 Cache	50 MB	12 TB/s	~200 cycles	Global — all SMs
SRAM (Shared Memory)	228 KB / SM	33 TB/s	~30 cycles	Per thread-block
Registers	256 KB / SM	>100 TB/s	1 cycle	Per thread

Look at that bandwidth column. From HBM to registers, bandwidth increases by more than 30×. A byte sitting in a register is accessed in a single cycle with effectively infinite bandwidth. A byte in HBM takes 400 cycles to arrive and must share a 3 TB/s bus with every other SM on the chip.

GPU Memory Hierarchy

Data flows from HBM through the cache hierarchy to registers. The bars show relative bandwidth at each level. Hover over each level to see details. Click "Animate" to watch a tile's journey from HBM to a tensor-core multiply.

Click to animate data flow

The register bottleneck

Registers are the fastest memory on the chip, but they come with a hard constraint: each thread can use at most 255 registers on the H100. If your kernel needs more, the compiler spills registers to L1 cache — which is an order of magnitude slower. This is a cliff, not a slope. Exceeding 255 registers per thread means your carefully-tuned inner loop suddenly has cache-miss latency in every iteration.

ThunderKittens manages this constraint explicitly. It provides a warpgroup::increase_registers<232>() call for compute-heavy warps and warpgroup::decrease_registers<40>() for load-focused warps. By specializing warps, each one stays within its register budget while collectively the thread-block uses the full register file.

Shared memory: the staging area

Shared memory (SRAM) sits between registers and HBM. It is programmer-managed — you explicitly load data into it and explicitly read from it. On the H100, each streaming multiprocessor has 228 KB of shared memory, organized into 32 physical banks.

Bank conflicts are the hidden killer. When two threads in the same warp access different addresses in the same bank, the accesses are serialized. A 32-way bank conflict means your shared memory load takes 32× longer than expected. Naive row-major layouts for 16×64 BF16 tiles cause 8-way bank conflicts. ThunderKittens uses swizzled layouts that XOR address bits to distribute accesses across banks, achieving zero bank conflicts.

Why swizzling matters: A 16×64 BF16 tile has 64 columns of 2 bytes each = 128 bytes per row. With 32 banks of 4 bytes each, that is exactly 128/4 = 32 bank slots per row. A naive layout means every row starts at the same bank. Two warps reading different rows of the same tile collide on every bank. Swizzling XORs the row index into the column address, rotating each row's bank assignment. The result: zero conflicts, full bandwidth.

The H100's 132 SMs

The H100 has 132 streaming multiprocessors (SMs). Each SM has 4 physical execution units that can run simultaneously: a load/store unit, an ALU, an FMA unit, and a tensor core. This means four different types of instructions can execute in parallel within a single SM — but only if your code is structured to exploit this parallelism.

Each SM can host up to 64 software warps (2048 threads). But more warps means each warp gets fewer registers. This is the occupancy tradeoff: high occupancy hides latency (more warps to switch between while waiting for memory), but low occupancy gives each warp more registers to work with. ThunderKittens lets you control this tradeoff explicitly through its block-level template.

The conventional CUDA wisdom is "maximize occupancy." But ThunderKittens shows this is wrong for compute-bound kernels. Their experiments reveal an inverted-U relationship: performance increases with occupancy up to a point, then decreases as register pressure causes spills. The LCSF template expands the Pareto frontier — at any given occupancy level, a pipelined LCSF kernel outperforms a non-pipelined one because it hides latency through software pipelining rather than hardware warp switching.

The sweet spot for attention on H100 is 2–3 compute warpgroups per SM with 4 pipeline stages. This gives enough warps to keep the SM busy during barrier waits, while giving each compute warp enough registers (232) to hold all tile data without spills. Going to 4 warpgroups would reduce registers to ~170 per thread and force tile sizes down, losing more performance to smaller tiles than is gained from higher occupancy.

A worked example: why arithmetic intensity matters

Let's make the memory wall concrete with a simple matrix multiply C = A × B, where A is M×K and B is K×N. The total arithmetic is 2·M·K·N FLOPs (one multiply and one add per output element, for each of K inner-product terms). The total data is (M·K + K·N + M·N) elements to read/write.

Arithmetic Intensity = FLOPs / Bytes = 2MKN / (2(MK + KN + MN))

For a square matrix M = N = K = 4096 in BF16 (2 bytes each):

Quantity	Value
FLOPs	2 × 4096³ = 137 billion
Bytes (BF16)	2 × 3 × 4096² = 100 MB
Arithmetic intensity	137B / 100M = 1,365 FLOPs/byte
H100 compute/BW ratio	989 TFLOPS / 3 TB/s = 330 FLOPs/byte

The required intensity (1,365) exceeds the hardware's compute-to-bandwidth ratio (330), so this problem is compute-bound — the math is the bottleneck, not the memory. Good. But now consider the attention inner loop for sequence length 4096 and head dimension 64:

Quantity	Value
QK^T FLOPs per tile	2 × 16 × 64 × 16 = 32K
Bytes to load K tile (16×64 BF16)	2,048
Arithmetic intensity (QK^T only)	32K / 2K = 16 FLOPs/byte

16 FLOPs per byte. That is 20× below the H100's compute-to-bandwidth ratio. If you load K from HBM for every tile multiply, the tensor cores sit idle 95% of the time. The only way to make attention compute-bound is to load K into shared memory once and reuse it across many query rows — and to keep the attention scores and output accumulators in registers so they never touch HBM at all. This is exactly what ThunderKittens (and FlashAttention) do.

The rule of thumb: If your arithmetic intensity is below ~330 FLOPs/byte on H100, your kernel is memory-bound and the tensor cores are starved. The solution is always the same: move data closer to compute (registers > SRAM > L2 > HBM) and reuse it as many times as possible before evicting.

Why does ThunderKittens use swizzled shared-memory layouts instead of simple row-major?

Row-major is slower because it requires more memory allocation Row-major causes cache misses in L2 Row-major causes bank conflicts — multiple warps reading different rows hit the same shared-memory banks, serializing access. Swizzling XORs row indices into column addresses to distribute accesses across all 32 banks.

Chapter 2: Tiles as First-Class Primitives

Here is the core design insight of ThunderKittens: the 16×16 matrix tile is the right abstraction for GPU programming.

Why 16×16? Because that is the native size of the H100's tensor-core instruction. When you call mma_AB(C, A, B, C) in ThunderKittens, it compiles down to a single HMMA instruction that multiplies a 16×16 tile of A by a 16×16 tile of B and accumulates into C. No loops, no thread-level indexing, no register shuffling. One tile operation = one hardware instruction.

ThunderKittens defines three types of tiles, one for each memory level:

Lives in the warp's register file. Fastest access. Each warp collectively owns tiles in registers. Example: rt_bf<16, 64> = 16 rows, 64 cols of BF16.

↓ load / store

Shared Tile (st)

Lives in shared memory (SRAM). Visible to all warps in a thread-block. Example: st_bf<64, 64> = 64×64 BF16 tile with swizzled layout.

↓ TMA load / store

Global Layout (gl)

Descriptor for data in HBM. 4D tensor with compile/runtime dims. Example: gl<bf16, -1, -1, -1, 64> = dynamic batch/head/seq, static d=64.

The type system

ThunderKittens tiles are C++ templates parameterized by type, height, and width. The naming convention encodes the data type directly:

Declaration	Memory	Type	Shape
`rt_bf<16, 64>`	Registers	BFloat16	16 × 64
`rt_fl<16, 16>`	Registers	Float32	16 × 16
`st_bf<64, 64>`	Shared	BFloat16	64 × 64
`st_fl<64, 64>`	Shared	Float32	64 × 64

Register tiles also carry a layout tag — either row-major or column-major. This is critical because tensor-core instructions require specific layouts: mma_AB needs A in row-major and B in column-major. ThunderKittens tracks layouts at compile time and raises errors if you try to multiply incompatible tiles. No silent correctness bugs from layout mismatches.

cuda
using namespace kittens;

// Register tiles — one per warp
rt_bf<16, 64> q_reg;      // query: 16 rows, 64 cols, BF16, row-major
rt_bf<16, 64> k_reg;      // key tile
rt_fl<16, 16> att;        // attention scores: float32 for precision
rt_bf<16, 64> o_reg;      // output accumulator

// Shared tiles — visible to all warps in the block
st_bf<64, 64> k_smem[2]; // double-buffered key tiles
st_bf<64, 64> v_smem[2]; // double-buffered value tiles

// Global layout descriptor — 4D: [batch, head, seq, dim]
gl<bf16, -1, -1, -1, 64> globals_k; // last dim static

Why tiles and not tensors? A "tensor" abstraction (like Triton's) is too high-level — the compiler must guess how to decompose it into register-sized chunks, and it often guesses wrong. A "thread" abstraction (like raw CUDA) is too low-level — you must manually coordinate 32 threads per warp to collectively hold a tile. The 16×16 tile is the Goldilocks zone: it matches the hardware primitive exactly, so the mapping from abstraction to instruction is one-to-one.

Larger tiles from composition

What "owns" a register tile?

A critical detail: register tiles are owned by warps, not threads. A warp is 32 threads executing in lockstep. When you declare rt_bf<16, 64>, the 16×64 = 1,024 BF16 elements are distributed across the 32 threads of the warp — each thread holds 32 elements (64 bytes, using 16 of its 255 available 32-bit registers). The distribution pattern is dictated by the tensor-core instruction format, and ThunderKittens manages it automatically.

This is why layout matters at compile time. A row-major rt_bf<16, 64> distributes elements so that the tensor-core mma_AB instruction can read matrix A directly from the warp's registers. A column-major layout distributes elements for matrix B. Passing a row-major tile as B would mean the hardware reads the wrong elements — a silent correctness bug. ThunderKittens' compile-time layout tags prevent this entirely.

Register accounting: A rt_bf<16, 64> tile costs 16 registers per thread (1024 BF16 elements / 32 threads = 32 BF16 elements per thread = 16 32-bit registers). An attention warp holding q_reg (16 regs), att (8 regs for 16×16 FP32), o_reg (16 regs), plus softmax state (max_vec, sum_vec, ~4 regs) uses about 44 registers just for tile data. Add loop counters, pointers, and temporaries, and you need ~60–80 registers per thread. With the H100's 255-register limit, you can fit the entire attention inner loop in registers — but barely. This is why register budget management (Chapter 5) matters.

Real kernels need tiles bigger than 16×16. ThunderKittens handles this through composition: a rt_bf<16, 64> is internally four 16×16 tiles laid out contiguously in registers. A st_bf<64, 64> is sixteen 16×16 tiles in shared memory. Operations on larger tiles decompose into operations on the constituent 16×16 subtiles — the compiler unrolls this at compile time, producing exactly the right sequence of tensor-core instructions.

At the warpgroup level (4 warps working collectively), tiles can be even larger. A warpgroup can collectively own a 64×64 register tile, with each of its 4 warps holding a 16×64 slice. The warpgroup::mma_AB() operation coordinates all 4 warps to perform a 64×64 matrix multiply using their collective registers.

The composition is purely mechanical: a 64×64 MMA decomposes into 4×4 = 16 underlying 16×16 HMMAs, one per warp per subtile. The compiler unrolls this at compile time — at runtime, you get a flat sequence of 16 tensor-core instructions with no loop overhead. This means the cost of using larger tiles is exactly proportional to their area, with no additional overhead for coordination or indexing.

The global layout descriptor (gl) is the bridge between ThunderKittens' tile world and PyTorch's tensor world. You create a gl from a raw pointer and shape information, and TMA uses it to fetch 2D tiles from a 4D tensor. The last dimension is typically static (head dimension d=64 or d=128), while batch, head, and sequence dimensions are dynamic. This lets a single compiled kernel handle variable batch sizes and sequence lengths without recompilation.

Why does ThunderKittens use 16×16 as its fundamental tile size?

Because 16×16 is the native size of the H100 tensor-core HMMA instruction — one tile operation maps to exactly one hardware instruction, with no decomposition needed Because 16×16 tiles fit exactly in the L2 cache Because 16 is the number of threads in a half-warp

Chapter 3: Tile Operations

ThunderKittens provides four categories of operations on tiles. The API is deliberately PyTorch-like — if you know PyTorch, you already know the verbs.

1. Memory operations

Move data between memory levels:

cuda
// Register <-> Shared memory
load(k_reg, k_smem[subtile]);       // shared -> register
store(k_smem[subtile], k_reg);       // register -> shared

// Shared memory <-> HBM (via TMA — Tensor Memory Accelerator)
tma::load_async(k_smem, globals.k,   // HBM -> shared (async)
    {batch, head, iter, 0}, barrier);
tma::store_async(globals.o, o_smem); // shared -> HBM (async)

// Copy between register tiles
copy(att_bf16, att_fl32);             // type conversion + copy

The tma::load_async calls use the H100's Tensor Memory Accelerator — a dedicated hardware unit that handles address generation and data movement without consuming any compute resources. While TMA loads data from HBM to shared memory, the tensor cores can continue multiplying tiles that were loaded in the previous iteration.

TMA is a Hopper-generation feature. On A100, async loads required the programmer to manage the address generation manually. On H100, TMA handles the entire load pipeline: computing the address from a 4D index (batch, head, sequence, dimension), performing bounds checking, and transferring data directly to shared memory — all without any thread involvement. This frees up warps to do useful compute instead of acting as data movers.

The tma::expect() call tells the barrier how many bytes to expect, so the waiting warps know when the load is complete. This producer-consumer pattern is the foundation of the LCSF template (Chapter 5).

2. Matrix multiply (MMA)

The crown jewel. These map directly to tensor-core instructions:

cuda
// C = A * B + C  (accumulate into C)
mma_AB(C, A, B, C);     // A row-major, B column-major
mma_ABt(C, A, B, C);    // A row-major, B row-major (B transposed internally)

// Warpgroup-level: 4 warps collectively multiply larger tiles
warpgroup::mm_ABt(att, q_smem, k_smem);  // 64x64 multiply
warpgroup::mma_AB(o, att_bf16, v_smem);  // 64x64 accumulate
warpgroup::mma_async_wait();              // wait for completion

Layout enforcement: mma_AB requires A in row-major and B in column-major layout. If you pass tiles with wrong layouts, ThunderKittens raises a compile-time error. This catches bugs that in raw CUDA would silently produce wrong results and take days to debug.

The swap_layout_inplace() function converts a tile between row-major and column-major without moving data — it just reinterprets how registers map to matrix elements. This is free at runtime but changes the compile-time layout tag:

cuda
load(v_reg, v_smem[subtile]);           // loads as row-major
auto &v_col = swap_layout_inplace(v_reg); // now column-major (free!)
mma_AB(o_reg, att_mma, v_col, o_reg);   // compiles: B is column-major

3. Pointwise operations

Element-wise operations on register tiles, using the general-purpose ALU (not tensor cores):

cuda
exp(att, att);                // element-wise exponential
multiply(scaled, att, 0.125); // scalar multiply
zero(att);                    // fill with zeros
max(result, a, b);            // element-wise max

4. Row / column reductions

These are essential for softmax — you need row-wise max and row-wise sum:

cuda
row_max(max_vec, att);        // max of each row -> vector
sub_row(att, att, max_vec);   // subtract row max (for numerical stability)
exp(att, att);                // exponentiate
row_sum(sum_vec, att);        // sum of each row -> vector
div_row(att, att, sum_vec);   // normalize each row

These five lines are the entire softmax computation. In raw CUDA, softmax requires careful warp-shuffle reductions, shared-memory atomics, and multi-pass algorithms. In ThunderKittens, it is five function calls that read like pseudocode.

Under the hood, row_max and row_sum use warp-shuffle instructions to reduce across the 32 threads that collectively hold a row of the tile. Each thread holds a subset of elements; the shuffle reduction collects the partial results without going through shared memory. ThunderKittens generates these shuffles automatically from the tile layout — the user never sees them.

The sub_row and div_row operations broadcast a row vector (one value per row) across all columns of the tile. Again, the broadcast pattern depends on which thread holds which element, and ThunderKittens handles this automatically based on the tile's compile-time layout.

Tile Operations Map

The four categories of tile operations and which hardware units execute them. Click a category to highlight its data path through the memory hierarchy.

Parallelism within an SM: The H100 has 4 physical execution units per SM: load/store, ALU, FMA, and tensor core. When your kernel interleaves memory loads, pointwise ops, and MMA instructions, all 4 units can execute simultaneously. ThunderKittens' separation of operation types makes this pipelining natural — loads happen on the load/store unit while MMAs happen on the tensor core.

What does swap_layout_inplace() actually do at runtime?

Transposes the tile by moving data between registers Copies the tile to shared memory in the new layout Nothing at runtime — it reinterprets how existing registers map to matrix elements, changing only the compile-time layout tag. It is free.

Chapter 4: Warp-Level Kernels

The simplest ThunderKittens kernel operates at the warp level: one warp, operating on its own register tiles, calling tile operations. This is where you start. Let's build an attention inner loop from scratch.

Recall the attention formula: Attention(Q, K, V) = softmax(QK^T / √d) · V. For a single query row q and a block of keys K and values V, the computation is:

att = q · K^T
att = att − max(att) (numerical stability)
att = exp(att)
att = att / sum(att)
o = att · V

In ThunderKittens, this entire inner loop — the hottest code in all of deep learning — is 11 lines:

cuda
using namespace kittens;
rt_bf<16, 64> k_reg, v_reg;

// Load K tile from shared memory
load(k_reg, k_smem[subtile]);

// QK^T: compute attention scores
zero(att);
mma_ABt(att, q_reg, k_reg, att);

// Softmax
sub_row(att, att, max_vec);
exp(att, att);
div_row(att, att, norm_vec);

// Convert to BF16 for the V multiply
copy(att_mma, att);

// Attend to values: O += att * V
load(v_reg, v_smem[subtile]);
auto &v_reg_col = swap_layout_inplace(v_reg);
mma_AB(o_reg, att_mma, v_reg_col, o_reg);

Let's trace the data flow step by step:

load(k_reg, k_smem)

K tile moves from shared memory to register file. 16×64 BF16 = 2048 bytes. Takes ~30 cycles via shared memory bus.

↓

mma_ABt(att, q, k, att)

Tensor core multiplies q (16×64) by k^T (64×16) = att (16×16). One HMMA instruction. ~0.04 ns.

↓

sub_row + exp + div_row

Softmax on the ALU. Three passes over the 16×16 attention tile in registers. No memory traffic.

↓

copy(att_mma, att)

Convert FP32 attention weights to BF16 for the next MMA. In-register conversion.

↓

mma_AB(o, att, v, o)

Tensor core multiplies att (16×16, BF16) by V (16×64, col-major) and accumulates into O (16×64). One HMMA instruction.

Notice what doesn't happen: no data goes to HBM. No data even goes to shared memory during the core computation. Everything stays in registers. The only shared memory accesses are the initial loads of K and V, which can be pipelined with the previous iteration's compute.

This is why ThunderKittens is fast. The abstraction is not just syntactic sugar. By making register tiles the default data structure, it forces the programmer to think about data locality. You cannot write a ThunderKittens kernel that accidentally round-trips through HBM in a hot loop — the type system makes that operation explicit and visible.

Counting the hardware instructions

Let's count exactly what the hardware does for this inner loop. For one iteration (one K tile, one V tile):

Operation	Hardware Unit	Instructions
`load(k_reg, k_smem)`	Load/Store unit	~4 LDS instructions (loading 16×64 BF16 from SRAM)
`mma_ABt(att, q, k, att)`	Tensor Core	4 HMMA instructions (16×64 × 64×16 = four 16×16 multiplies)
`sub_row + exp + div_row`	ALU / FMA	~12 instructions (operating on 16×16 = 256 elements / 32 threads = 8 per thread)
`copy(att_mma, att)`	Register rename	~4 CVT instructions (FP32 to BF16)
`load(v_reg, v_smem)`	Load/Store unit	~4 LDS instructions
`mma_AB(o, att, v, o)`	Tensor Core	4 HMMA instructions

The key observation: the load/store, ALU, and tensor-core instructions use different hardware units. They can execute in parallel. While the tensor core is doing the mma_ABt, the load/store unit can be prefetching the next V tile. While the ALU computes softmax, the tensor core might be finishing a multiply from the previous cycle. This instruction-level parallelism is what makes ThunderKittens' tile abstraction map so efficiently to hardware.

Comparison with FlashAttention-2

FlashAttention-2 implements the same tiling strategy — keeping Q in registers, streaming K and V through shared memory. But FA2 is written in raw CUDA with manual warp-level matrix operations, explicit register allocation, and hand-tuned shared memory layouts. The FA2 forward kernel is hundreds of lines of intricate CUDA. The ThunderKittens version is 11 lines of tile operations that a compiler can verify for correctness.

And the performance? On the H100, ThunderKittens' attention forward kernel matches FlashAttention-3 — the state-of-the-art hand-tuned implementation that uses Hopper-specific features like TMA and warp specialization. ThunderKittens uses those same features, but exposes them through clean abstractions instead of raw PTX.

This equivalence is not an accident. FA3 and ThunderKittens use the same algorithmic approach (tiled attention with online softmax), the same hardware features (TMA, warp specialization, register reallocation), and the same memory hierarchy strategy (Q in registers, K/V streaming through SRAM). The only difference is the programming model: FA3 is hundreds of lines of intricate CUDA with hand-managed PTX intrinsics; ThunderKittens is tens of lines of tile operations. Same performance, dramatically less code, dramatically lower bug surface.

In the ThunderKittens attention inner loop, where do the attention scores (att) live during the softmax computation?

Entirely in registers — sub_row, exp, and div_row all operate on register tiles with no memory traffic In shared memory, because softmax needs reduction across the row In L2 cache, because the attention matrix is too large for registers

Chapter 5: Block-Level — LCSF Template

A single warp can only use one tensor core. But an SM has multiple tensor cores, and a thread-block can contain multiple warp groups. To saturate the hardware, we need to coordinate multiple warps working on different parts of the problem while sharing data through shared memory.

ThunderKittens provides a block-level programming template called LCSF: Load, Compute, Store, Finish. You fill in four functions, and ThunderKittens handles all the synchronization, barrier management, and warp specialization.

L — Load

Producer warps load data from HBM to shared memory via TMA. Signal completion with barriers.

↓ barrier

C — Compute

Consumer warp groups run tensor-core operations on shared tiles. Accumulate results in registers.

↓ barrier

S — Store

Store warps write results from shared memory back to HBM via TMA.

↓

F — Finish

Final synchronization, state cleanup, iteration counter update.

Producer-consumer warp specialization

The key idea is warp specialization: different warps in the same thread-block do different jobs. Some warps are "producers" (loading data), others are "consumers" (computing on data). This lets loads and computes overlap in time — while the compute warps are multiplying the current tile, the load warps are already fetching the next tile from HBM.

cuda
// === LOAD function (producer warp) ===
if(warpgroup::warpid() == 0) {
    // Tell TMA how many bytes to expect
    tma::expect(inputs_arrived, block.k, block.v);

    // Fire off async loads from HBM -> shared memory
    tma::load_async(block.k, globals.k,
        {batch, head, iter, 0}, inputs_arrived);
    tma::load_async(block.v, globals.v,
        {batch, head, iter, 0}, inputs_arrived);
}
else arrive(inputs_arrived);  // non-load warps just signal

cuda
// === COMPUTE function (consumer warp groups) ===
warpgroup::mm_ABt(att, scratch.q[state.id], block.k);
warpgroup::mma_async_wait();

// Online softmax
sub_row(att, att, max_vec);
exp(att, att);
div_row(att, att, norm_vec);
copy(att_bf16, att);

// Accumulate O += softmax(QK^T) * V
warpgroup::mma_AB(state.o, att_bf16, block.v);
warpgroup::mma_async_wait();
arrive(inputs_finished);

Register budget management

Different warps need different resources. Load warps barely touch registers — they just issue TMA commands and wait on barriers. Compute warps need as many registers as possible to hold Q tiles, attention tiles, output accumulators, and softmax state. ThunderKittens lets you explicitly reallocate registers between warps:

cuda
// Load warp: give up registers (only need 40)
warpgroup::decrease_registers<40>();

// Compute warp: claim extra registers (up to 232)
warpgroup::increase_registers<232>();

This is a Hopper-specific feature (the setmaxnreg PTX instruction). On A100, you cannot dynamically reallocate registers, so all warps must share the same budget. On H100, ThunderKittens exploits this to give compute warps nearly the full 255-register limit while keeping load warps lean.

The math works out cleanly. A thread-block with 1 load warpgroup (128 threads at 40 regs each = 5,120 regs) and 2 compute warpgroups (256 threads at 232 regs each = 59,392 regs) uses a total of 64,512 registers. The H100 has 65,536 registers per SM (256 KB / 4 bytes), so this configuration uses 98.4% of the register file. Almost nothing is wasted.

Without warp specialization, all 384 threads would share the register budget equally: 65,536 / 384 = 170 registers per thread. The compute warps would have 170 instead of 232 — a 27% reduction in register capacity, meaning smaller tiles, more spills, and worse performance. Warp specialization is not just a convenience; it is a fundamental enabler of high tile occupancy.

Pipelining stages

The LCSF template supports multi-stage pipelining. With N stages, you can have N tiles in flight simultaneously: stage 0 is being computed, stage 1 is being loaded, stage 2's load just finished, etc. The effect on performance is dramatic:

Pipeline Stages	TFLOPS (GEMM, M=N=K=4096)	% of Peak
1 (no pipelining)	260	26%
2	484	49%
3	683	69%
4	760	77%

Here is a timeline view of what happens with 4-stage pipelining. Each row is a pipeline stage, and time flows left to right:

timeline
Time  -->    t0       t1       t2       t3       t4       t5
            -------- -------- -------- -------- -------- --------
Stage 0:    LOAD     COMPUTE  STORE
Stage 1:             LOAD     COMPUTE  STORE
Stage 2:                      LOAD     COMPUTE  STORE
Stage 3:                               LOAD     COMPUTE  STORE

At time t1, stage 0 is computing while stage 1 is loading. At t2, stage 0 is storing, stage 1 is computing, and stage 2 is loading. The tensor cores are busy at every time step after the pipeline fills. Without pipelining (1 stage), the tensor cores sit idle during every load and store phase.

From 26% to 77% — just by pipelining. With a single stage, the tensor cores sit idle while waiting for the next tile to load from HBM. With 4 stages, there are always tiles ready to compute. The cost is shared memory (each stage needs its own buffer), but the H100 has 228 KB per SM — enough for 4 stages of 64×64 BF16 tiles.

Why does ThunderKittens give load warps only 40 registers while giving compute warps 232?

Load warps only issue TMA commands and wait on barriers — they barely use registers. Compute warps need registers to hold Q tiles, attention scores, output accumulators, and softmax state. Redistributing registers maximizes the compute warps' tile capacity. Load warps run on a different SM that has fewer registers The CUDA compiler automatically balances register usage

Chapter 6: Grid-Level Scheduling

The final level of the ThunderKittens hierarchy is the grid — how thread-blocks are scheduled across the 132 SMs of the H100. Two optimizations at this level have outsized impact: persistent grids and block launch order.

Persistent grids

In a normal CUDA launch, the GPU creates one thread-block per output tile. When a block finishes, the hardware scheduler launches a new one. This launch overhead is significant for small tiles — setting up a new block (allocating shared memory, initializing barriers, loading constants) can take thousands of cycles.

A persistent grid launches exactly as many blocks as there are SMs (132 on H100) and keeps them alive for the entire kernel. Each block processes multiple output tiles by iterating through a work queue. No launch overhead, no context switching.

K	TK (no persist)	TK (persist)	CuBLAS
64	93 TFLOPS	108 TFLOPS	69 TFLOPS
128	161	184	133
256	271	309	242
512	414	450	407

For skinny matrices (small K), persistent grids give a 16% speedup. Even for larger K, the persistent version consistently outperforms. And both TK configurations beat CuBLAS at small K — a 40-line kernel outperforming NVIDIA's flagship library.

Why does CuBLAS struggle at small K? At K=64 with M=N=4096, the problem is tiny: 2 × 4096 × 64 × 4096 = 2.1 billion FLOPs. On a 989 TFLOPS GPU, that should take about 2 microseconds. But launching 4096/128 × 4096/128 = 1,024 thread-blocks takes overhead that is a significant fraction of 2 microseconds. CuBLAS launches fresh blocks; ThunderKittens' persistent grid already has 132 blocks running — they just pick up the next work item from a queue. Zero launch overhead.

This matters in practice because many ML workloads involve sequences of small GEMMs (e.g., MLP layers with small batch sizes during inference, or the K and V projections in attention with small sequence lengths). CuBLAS is tuned for large square matrices; ThunderKittens performs well across the entire size spectrum.

Block launch order and L2 reuse

The order in which blocks process output tiles determines how well the L2 cache is utilized. Consider a GEMM C = A × B where C is tiled into a grid. If blocks iterate in row-major order {N, M}, each new block needs a completely different column of B. But if blocks iterate in a swizzled order {8, N, M/8}, neighboring blocks share columns of B, which stay hot in L2.

Kernel	Block Order	HBM BW	TFLOPS
GEMM 16K³	{8, N, M/8} (swizzled)	982 GB/s	805
GEMM 16K³	{N, M} (naive)	3,070 GB/s	392
Attention	{N, H, B} (optimal)	213 GB/s	600
Attention	{B, H, N} (naive)	2,390 GB/s	494

3× less HBM traffic, 2× more TFLOPS. The swizzled GEMM order reduces HBM bandwidth from 3,070 GB/s to 982 GB/s (3.1× less), and TFLOPS jumps from 392 to 805 (2.1× more). The same data is read from HBM far fewer times because the L2 cache keeps recently-used tiles hot. For attention, the right block order ({N, H, B} instead of {B, H, N}) reduces HBM traffic by 11× and improves throughput by 21%.

Putting it all together: three design principles

ThunderKittens is organized around three principles, one for each level:

Principle 1: Tile Data Structures

16×16 tiles with managed layouts as the basic data structure. PyTorch-like operations. Compile-time layout checking. Warp level.

↓

Principle 2: LCSF Template

Producer-consumer warp specialization with pipelined async loads. Developer fills in 4 functions. Block level.

↓

Principle 3: Grid Scheduling

Persistent grids with configurable block launch order for L2 reuse. Grid level.

Each principle addresses a different bottleneck: register utilization (warp), latency hiding (block), and cache reuse (grid). Together, they let a 40-line kernel achieve 80%+ of theoretical peak.

A concrete example: block order for attention

For attention with batch size B, H heads, and sequence length N, each output tile is indexed by (b, h, n). The block launch order determines which tiles are processed by adjacent blocks on the same SM cluster — and therefore which input tiles remain in L2.

With order {B, H, N}: block 0 processes (0, 0, 0), block 1 processes (0, 0, 1), etc. Consecutive blocks along N share the same Q, K, V tensors for the same (b, h) pair. The K and V tiles from the current sequence position stay hot in L2 for the next block that processes the next query position of the same head.

With order {N, H, B}: block 0 processes (0, 0, 0), block 1 processes (0, 1, 0) — a completely different head. The K, V tiles from head 0 are evicted before they can be reused. Every block reads fresh data from HBM.

The difference: 213 GB/s HBM traffic (good order) vs 2,390 GB/s (bad order). That is 11× more HBM bandwidth wasted on re-reading data that could have stayed in L2. And the TFLOPS difference (600 vs 494) directly reflects this: more time waiting for HBM = less time computing.

Block order is free performance. Changing a single template parameter ({N, H, B} vs {B, H, N}) produces a 21% throughput improvement with zero changes to the kernel code. This is pure scheduling — the same operations execute in the same way, but the cache behavior changes dramatically. ThunderKittens makes this a first-class parameter precisely because its impact is so large.

Why does block launch order matter for GEMM performance?

Different block orders change the number of thread-blocks needed When neighboring blocks share input tiles (e.g., columns of B), those tiles stay hot in L2 cache, reducing HBM reads. A swizzled order ensures nearby blocks access overlapping data, cutting HBM traffic by 3× and doubling throughput. The GPU scheduler prefers blocks launched in ascending order

Chapter 7: The Attention Kernel

Now let's see how all three levels — warp, block, grid — combine in a complete attention kernel. This is the showpiece: a kernel that matches FlashAttention-3 on forward pass and outperforms all baselines by 10–40% on the backward pass.

Forward pass structure

The attention forward pass computes O = softmax(QK^T / √d) · V for sequences of length N with head dimension d. ThunderKittens tiles the computation:

Grid level

Persistent grid of 132 blocks. Block order: {N, H, B} for L2 reuse of K, V across sequence positions.

↓

Block level (LCSF)

1 load warpgroup (40 regs) + 2 compute warpgroups (232 regs each). 4-stage pipeline for K, V tiles.

↓

Warp level

Each compute warp holds q_reg (16×64), att (16×16), o_reg (16×64). Inner loop: load K → QK^T → softmax → load V → att·V → accumulate O.

The load warpgroup fires off TMA loads for the next K and V tiles while the compute warpgroups are still working on the current tiles. By the time the compute finishes, the next tiles are already in shared memory. This is the pipelining from Chapter 5 in action.

Online softmax

The softmax in attention must be computed over the entire sequence length, but we only see one K tile at a time. ThunderKittens uses the online softmax trick (the same one FlashAttention uses): maintain a running maximum and running sum, and correct the output accumulator whenever a new maximum is found.

cuda
// For each K tile in the sequence:
mma_ABt(att, q_reg, k_reg, att);   // QK^T for this tile

// Update running max
row_max(new_max, att);
max(new_max, new_max, old_max);

// Correction factor for previous accumulator
sub_row(correction, old_max, new_max);
exp(correction, correction);
mul_row(o_reg, o_reg, correction);  // rescale previous output

// Softmax current tile
sub_row(att, att, new_max);
exp(att, att);
row_sum(tile_sum, att);

// Accumulate
mma_AB(o_reg, att_bf16, v_col, o_reg);

// Update running state
copy(old_max, new_max);
add(running_sum, running_sum, tile_sum);

Backward pass: where ThunderKittens shines

The attention backward pass is harder to optimize than the forward pass because it requires recomputing the attention matrix and has more complex data dependencies. Here is where ThunderKittens' advantage is most visible:

Metric	FlashAttention-3	ThunderKittens	Advantage
Tensor core utilization	61.2%	58.2%	FA3 slightly higher
HBM bandwidth used	328 GB/s	490 GB/s	TK 1.5× more efficient
Shared memory stall cycles	1.83	0.14	TK 13× fewer stalls
Overall throughput	Baseline	+10–40%	TK faster end-to-end

13× fewer shared memory stalls. This is the payoff of swizzled layouts (Chapter 1) and register-level programming. FA3 achieves slightly higher tensor core utilization but wastes that advantage waiting for shared memory bank conflicts to resolve. ThunderKittens' zero-conflict layouts mean the tensor cores are almost never starved for data. The result: higher HBM utilization (the data pipeline runs smoothly) and 10–40% better end-to-end performance despite slightly lower tensor core occupancy.

GEMM: the foundation benchmark

Before attention, let's look at the simpler case: general matrix multiply (GEMM). ThunderKittens' GEMM kernel is just 40 lines of code using the LCSF template. Here is how it performs against CuBLAS (NVIDIA's flagship library, 689 MB of hand-tuned code):

Matrix Size (M=N=K)	TK (persistent)	CuBLAS	TK vs CuBLAS
4096	767 TFLOPS	735 TFLOPS	+4%
8192	~790 TFLOPS	~780 TFLOPS	~equal
16384	805 TFLOPS	~800 TFLOPS	~equal

A 40-line kernel matching a 689 MB library. At 16K×16K×16K, ThunderKittens achieves 805 TFLOPS — 81% of the H100's 989 TFLOPS theoretical peak. The remaining 19% is spent on shared memory loads, register-to-register copies, and softmax-like reductions. This is near the hardware limit.

For skinny matrices (small K), ThunderKittens actually beats CuBLAS by significant margins (up to 57% at K=64) because the persistent grid avoids block launch overhead that dominates at small problem sizes.

Causal attention

For autoregressive models, attention is causal — token i can only attend to tokens j ≤ i. ThunderKittens handles this through the tile scheduling: tiles where all keys come after the query position are simply skipped (zero attention). Tiles on the diagonal (partially causal) apply a mask within the tile. The tiling structure means most of the computation is either fully causal (skip) or fully non-causal (no mask needed), minimizing wasted work.

To be concrete: for a 4096-token sequence with 64×64 tiles, the attention matrix is 64×64 = 4,096 tiles. Of these, about half are below the diagonal (fully non-causal: compute normally), about half are above (fully causal: skip entirely), and 64 are on the diagonal (partially masked: apply the causal mask within the tile). Only 64 out of 4,096 tiles need masking logic — less than 2% of the total work. The overhead of causal masking is negligible when the tile size is chosen well.

ThunderKittens' tile abstraction makes this optimization automatic. The grid scheduler skips tiles above the diagonal, and the warp-level code applies the mask only for diagonal tiles. In raw CUDA, implementing this skip-logic efficiently requires careful bookkeeping; in ThunderKittens, it is a coordinate comparison in the grid loop.

Attention Kernel Data Flow

Shows how data flows through the three ThunderKittens levels during one iteration of the attention forward pass. Producer warps (teal) load the next K,V tiles while consumer warps (warm) compute on the current tiles.

Click to see pipelined execution

Why does ThunderKittens outperform FlashAttention-3 on the backward pass despite slightly lower tensor core utilization?

Because ThunderKittens has 13× fewer shared memory stall cycles (0.14 vs 1.83) due to swizzled bank-conflict-free layouts. FA3's tensor cores stall waiting for data, wasting their higher utilization rate. TK's smooth data pipeline achieves better end-to-end throughput. Because ThunderKittens uses a better attention algorithm Because ThunderKittens runs on newer hardware

Chapter 8: Emerging Architectures

Matching FlashAttention-3 on standard attention is impressive, but the real value of ThunderKittens is on emerging architectures — models that don't have hand-tuned CUDA kernels. When a new architecture is published, the first implementation is usually in PyTorch (slow), then maybe Triton (better), and eventually someone spends months writing a CUDA kernel (fast). ThunderKittens compresses that last step from months to days.

Linear attention: 6.5–14× speedup

Linear attention replaces the softmax(QK^T) with a kernel function φ(Q)φ(K)^T that can be computed in O(N) instead of O(N²). The trick is that the multiplication order changes: instead of computing the N×N attention matrix, you compute φ(K)^TV first (a d×d matrix), then multiply by φ(Q).

Standard: O = softmax(QK^T) · V O(N²d)
Linear: O = φ(Q) · (φ(K)^T · V) O(Nd²)

The catch is that the d×d intermediate state must be maintained across the sequence as a running sum — this is the "state" in "state-space model." Maintaining this state efficiently requires careful register management, which is exactly what ThunderKittens provides.

Variant	Baseline	ThunderKittens	Speedup
Polynomial feature map	Flash Linear Attention	TK kernel	14×
Learned feature map	Flash Linear Attention	TK kernel	6.5×

14× is not a typo. The Flash Linear Attention baseline is already an optimized kernel, not naive PyTorch. The speedup comes from ThunderKittens' ability to keep the d×d state in registers and pipeline the sequential updates through the LCSF template.

Why is the state-maintenance so hard to optimize? In linear attention, you maintain a running state matrix S (d×d) that is updated at every time step: S_t = S_t-1 + φ(k_t) · v_t^T. Then the output is o_t = φ(q_t) · S_t. For d=64, S is 64×64 = 4,096 elements in FP32 = 16 KB. That is too large for one warp's registers (which hold at most ~8 KB of tile data), so you need warpgroup-level collective ownership.

In Triton, you cannot express this: the compiler decides where S lives, and it typically puts it in shared memory, adding 30-cycle latency to every state access. In ThunderKittens, you explicitly declare S as a register tile distributed across a warpgroup, and every update and query is a register-to-register operation at 1-cycle latency. This 30× latency reduction on the most frequently accessed data structure explains most of the 14× speedup.

State-space models: Mamba-2

Mamba-2 is a state-space model that maintains a hidden state and updates it at each time step. The update involves matrix multiplies on the state, which maps naturally to ThunderKittens tiles:

Kernel	Baseline	ThunderKittens	Speedup
Mamba-2 forward	Triton kernel	TK kernel	>3×

The Triton baseline is already GPU-optimized. ThunderKittens' advantage comes from register-level control that Triton's compiler cannot achieve — Triton operates at the block level and cannot control which data lives in registers vs shared memory.

Mamba-2's selective scan involves a recurrence: h_t = A_t · h_t-1 + B_t · x_t, where A_t is a diagonal state transition matrix and h_t is the hidden state. The diagonal structure means A_t · h_t-1 is an element-wise multiply (a pointwise operation in ThunderKittens), while B_t · x_t is an outer product (a single MMA). The sequential nature of the recurrence means the state must persist in registers across time steps — any spill to SRAM or HBM adds latency to every step in the chain.

ThunderKittens keeps the entire state in register tiles, uses the LCSF template to pipeline the input loads, and processes chunks of time steps in parallel where the recurrence allows. The result: >3× faster than the best Triton implementation, with the same numerical results.

Long convolution via FFT

Some architectures (Hyena, S4) use long convolutions computed via FFT. The FFT involves complex multiplications on tiles of data that fit naturally into ThunderKittens' 16×16 abstraction:

Sequence Length	FlashFFTConv	ThunderKittens	Speedup vs FlashFFTConv	Speedup vs PyTorch FFT
1024	Baseline	TK kernel	7.9×	—
4096	Baseline	TK kernel	4.7×	8.7×

The profiling data tells the story of why ThunderKittens is faster:

Metric	FlashFFTConv	ThunderKittens
Tensor core utilization	13.4%	54.8%
HBM bandwidth	14.8 GB/s	31.4 GB/s

From 13% to 55% tensor core utilization. FlashFFTConv uses the tensor cores for barely a tenth of the time. ThunderKittens, by keeping data in register tiles and pipelining loads, keeps the tensor cores busy more than half the time. This 4× improvement in utilization translates directly to the 4.7–7.9× throughput gains.

The FFT kernel is particularly interesting because FFT is not traditionally thought of as a tensor-core workload. The key insight is that the FFT butterfly operations can be expressed as matrix multiplies on small tiles — the Monarch matrix decomposition factors the FFT into a sequence of block-diagonal matrix multiplies, each of which maps perfectly to ThunderKittens' 16×16 tiles. The TK FFT kernel uses 64×64 tiles with 4 pipeline stages (INPUT=4, OUTPUT=2), with multiple consumer warpgroups processing different frequency blocks in parallel.

This is a general lesson: many algorithms that seem inherently sequential or non-matmul can be reformulated as tile-based linear algebra. Once reformulated, ThunderKittens gives them near-peak performance almost for free. The library's value is not just in the kernels it ships, but in the thinking framework it promotes: decompose your algorithm into 16×16 tile operations, and the hardware will do the rest.

Why are these speedups so large?

The speedups on emerging architectures (3–14×) are much larger than on standard attention (~1×). This is not because the architectures are easier to optimize — it is because the baselines are weaker. Standard attention has FlashAttention-2 and FA3: kernels that took years of expert CUDA engineering. Emerging architectures have Triton kernels or PyTorch implementations that no one has spent months hand-tuning.

The Triton compiler operates at the block level — it sees tiles of data but cannot control which tiles live in registers vs shared memory. For standard attention, Triton's block-level view happens to produce good code because the access pattern is regular. For state-space models, where the computation involves sequential state updates with complex data dependencies, Triton's compiler makes suboptimal choices about register allocation and memory placement.

ThunderKittens closes this gap by giving the programmer explicit register control with a clean API. The "months of CUDA expertise" is baked into the library's abstractions (swizzled layouts, LCSF pipelining, persistent grids), so the kernel author only needs to express the algorithm's tile-level logic.

This has profound implications for ML research velocity. Today, proposing a new attention variant (linear attention, sliding window, multi-query, grouped-query) means either accepting PyTorch-speed benchmarks that understate the method's potential, or waiting months for someone to write a CUDA kernel. With ThunderKittens, the researcher can write a competitive kernel in days and benchmark against the state of the art on equal footing. The library removes the "kernel engineering" tax from architecture research.

Consider the timeline for Based (linear attention with polynomial feature maps). The paper was published, a PyTorch implementation was available, a Triton kernel followed, but the ThunderKittens kernel showed that all previous benchmarks had underestimated the method's throughput by 14×. The algorithm was always fast — the implementations just couldn't show it. How many good ideas have been rejected because their benchmark implementations were slow? ThunderKittens makes this a question we can stop asking.

The pattern

Across all these emerging architectures, the story is the same: existing implementations leave 50–90% of the hardware idle because they cannot control register placement. ThunderKittens fills this gap. The library's abstractions — tiles, LCSF, persistent grids — are general enough that any tile-based computation can be expressed in tens of lines and achieve 50–80% of theoretical peak.

And the development cost? The paper reports that these kernels were written by undergraduates with no prior CUDA experience in weeks, not months. That is the true contribution of ThunderKittens: democratizing high-performance GPU programming.

Why are ThunderKittens' speedups largest on emerging architectures (6–14×) compared to standard attention (~1×)?

Because standard attention has a simpler computation pattern Because emerging architectures use different data types Because standard attention already has expert-tuned CUDA kernels (FA2, FA3) that exploit the same hardware tricks, while emerging architectures only have Triton/PyTorch baselines that cannot control register placement. ThunderKittens closes the gap between new algorithms and expert-level hardware utilization.

Chapter 9: Connections

What ThunderKittens changes

ThunderKittens is a systems paper, but its implications extend far beyond GPU programming:

Domain	Impact
ML research	New architectures no longer need to wait months for efficient kernels. Linear attention, SSMs, and hybrid models can be prototyped and benchmarked at near-peak performance in days.
Compiler research	ThunderKittens shows that the right abstraction level matters more than compiler sophistication. Triton's block-level compiler cannot match hand-tuned register placement; TK's tile-level abstraction can.
Hardware design	The paper provides a template for what future GPU programming models should look like: expose hardware primitives (tensor cores, TMA, register reallocation) through clean abstractions, not through raw PTX.
Education	Undergraduates writing competitive CUDA kernels. This democratizes a skill that previously required years of systems experience.

Key numbers to remember

Fact	Number
H100 BF16 tensor core peak	989 TFLOPS
H100 HBM bandwidth	3 TB/s
Tensor core FLOPs vs general FP32	16×
Register limit per thread (H100)	255
GEMM TFLOPS (TK, persistent, 16K³)	805 (81% peak)
Attention backward speedup vs FA3	10–40%
Linear attention speedup (polynomial)	14×
FFT convolution speedup	4.7–7.9×
Mamba-2 speedup vs Triton	>3×
ThunderKittens library size	<1 MB

Limitations and future work

ThunderKittens is not a silver bullet. Its current limitations are worth understanding:

Limitation	Why
NVIDIA-only	Tightly coupled to CUDA, tensor cores, and TMA. No AMD ROCm or Intel support.
Hopper-optimized	Best features (TMA, setmaxnreg, warpgroup MMA) require H100. A100 support exists but with fewer optimizations.
BF16/FP32 focus	FP8 and integer quantization support is limited. As quantized inference becomes dominant, this matters.
Tile-shaped problems only	Irregular computations (sparse attention, graph neural networks) may not decompose cleanly into 16×16 tiles.
No autotuning	Pipeline stages, warpgroup counts, and block order are manual choices. An autotuner could explore this space automatically.

Despite these limitations, the design philosophy — expose hardware primitives through clean, composable abstractions at the right granularity — is broadly applicable. Future GPU architectures (NVIDIA Blackwell, AMD CDNA4) will have their own tile-level primitives, and the ThunderKittens approach of wrapping them in a readable DSL is likely to persist even as the specific hardware changes.

Related papers

FlashAttention (Dao, 2022)

The first IO-aware attention algorithm. Pioneered tiling attention for GPU memory hierarchy. ThunderKittens generalizes this approach to arbitrary kernels.

↓

FlashAttention-2/3 (Dao, 2023/2024)

Hand-tuned CUDA kernels exploiting warp specialization and Hopper features. ThunderKittens matches FA3 forward, beats backward by 10–40%.

↓

Triton (Tillet et al., 2019)

Block-level GPU programming language with compiler. Easier than CUDA but cannot control register placement. TK fills the gap between Triton and raw CUDA.

↓

CUTLASS / CuTe (NVIDIA)

C++ template library for tensor-core programming. Powerful but complex (22 MB of templates). ThunderKittens achieves similar performance in <1 MB with a cleaner API.

↓

Mamba / Mamba-2 (Gu & Dao, 2023/2024)

State-space models with selective scan. ThunderKittens provides >3× speedup on Mamba-2 kernels over Triton implementations.

↓

Based (Arora et al., 2024)

Linear attention with learned and polynomial feature maps. ThunderKittens provides 6.5–14× speedup on Based kernels.

The abstraction ladder

ThunderKittens sits at a very specific point on the abstraction-performance tradeoff curve:

Tool	Abstraction Level	Controls Registers?	Controls SRAM Layout?	Controls Grid?	Typical Lines
PyTorch	Tensor	No	No	No	5–20
Triton	Block	No	Partial	No	30–100
ThunderKittens	Tile	Yes	Yes	Yes	30–80
CUTLASS	Template	Yes	Yes	Yes	200–1000
Raw CUDA/PTX	Thread	Yes	Yes	Yes	500–5000

ThunderKittens gives you the same control as CUTLASS and raw CUDA — registers, shared memory layouts, grid scheduling — but in 30–80 lines instead of hundreds or thousands. The insight is that most of the complexity in CUTLASS and CUDA comes from managing thread-level details (which thread owns which element, how to synchronize warps, how to avoid bank conflicts). By lifting the abstraction to 16×16 tiles, all of that complexity is handled by the library.

The big picture. ThunderKittens belongs to a growing movement: making GPU hardware accessible without sacrificing performance. FlashAttention showed that hardware-aware algorithms beat naive implementations by orders of magnitude. ThunderKittens shows that you don't need to be a CUDA wizard to write hardware-aware code — you just need the right abstraction.

What is ThunderKittens' core contribution compared to existing GPU programming tools?

It provides a faster GPU compiler than Triton It introduces the 16×16 register tile as a first-class programming primitive — a Goldilocks abstraction between raw CUDA (too low) and Triton blocks (too high) that lets non-experts write kernels matching hand-tuned implementations in tens of lines instead of hundreds. It uses a new hardware instruction set not available to other libraries