GPU Kernel Landscape — From Absolute Zero to Mastery

Chapter 0: The Gap

You call torch.matmul(A, B) and it takes 4 milliseconds. An expert writes a custom CUDA kernel for the same operation and it takes 0.3 milliseconds. That's a 13x difference — same GPU, same data, same math.

Where does the gap come from? PyTorch launches a generic kernel that works for any shape, any dtype, any device. It can't assume your matrices are square. It can't fuse the ReLU you're about to apply next. It doesn't know you'll call this 10,000 times with the same shape. The generic kernel pays for flexibility with performance.

A custom kernel knows exactly what it's doing. It tiles the matrices to fit in fast on-chip memory. It fuses multiple operations into one pass. It uses Tensor Cores — special hardware units that do 16x16 matrix multiplies in a single clock cycle. It overlaps memory loads with computation so the GPU never stalls waiting for data.

The core tension: GPUs have extraordinary theoretical throughput — an H100 can do 990 TFLOPS of FP16 math. But reaching even 60% of that peak requires hand-crafting how data flows through registers, shared memory, and global memory. The gap between "works" and "fast" is the entire subject of this lesson.

This gap spawned an entire ecosystem of tools. CUDA (2007) gave you raw control but required expert-level C++. Triton (2021) let you write Python and get good kernels. CUTLASS gave template libraries for structured patterns. And in 2025, a dozen "Tile DSLs" appeared, each trying to find the sweet spot between productivity and performance.

By the end of this lesson, you'll understand why each exists, what problem it solves, and which to reach for in 2026.

The Performance Gap

Click each implementation to see how close it gets to hardware peak. The gap is what custom kernels close.

Why is there a performance gap between PyTorch's built-in matmul and a hand-written CUDA kernel?

PyTorch uses CPUs for matrix math Generic kernels can't exploit shape-specific tiling, fusion, or Tensor Cores as aggressively Python is inherently slower than C++

Chapter 1: Memory Hierarchy — Where Speed Lives

Every GPU performance problem is a memory problem. The arithmetic units on an H100 can multiply matrices at 990 trillion operations per second. But the memory that feeds them can only deliver 3.35 TB/s. If you do the math: to keep the ALUs busy, every byte loaded from memory must fuel ~296 FP16 operations. If your operation does fewer ops per byte (low arithmetic intensity), the GPU starves.

The solution is a memory hierarchy — multiple levels of storage, each faster and smaller than the last. Data starts in slow, massive HBM and gets staged through progressively faster caches until it reaches the registers where math actually happens.

Level	Size (H100)	Bandwidth	Latency
Registers	256 KB per SM	~20 TB/s	~1 cycle
Shared Memory	228 KB per SM	~15 TB/s	~20 cycles
L2 Cache	50 MB total	~6 TB/s	~200 cycles
HBM (Global)	80 GB total	3.35 TB/s	~400 cycles

Registers are the fastest. Each thread has its own private registers — accessing them costs just one clock cycle. But you only get a few hundred per thread, and they're invisible to other threads.

Shared memory (SMEM) is the key to fast kernels. It sits on-chip, shared among all threads in a thread block. It's 6x faster than HBM but only 228 KB per streaming multiprocessor. The entire art of GPU kernel writing is about loading a tile of data from HBM into shared memory, doing as much computation as possible on that tile, then writing the result back.

L2 cache is shared across the entire GPU. It helps when multiple thread blocks access nearby data, but you don't control what stays in L2 — the hardware does.

HBM (High Bandwidth Memory) is the "main memory" of the GPU. 80 GB is a lot, but at 400+ cycle latency, every trip to HBM is expensive. The golden rule: minimize HBM round-trips.

Arithmetic Intensity: the ratio of compute operations to bytes moved. A matmul of two (N x N) matrices does 2N³ FLOPs but loads 2N² elements — intensity grows with N. That's why matmul is compute-bound for large N. Pointwise ops like ReLU do 1 FLOP per element loaded — always memory-bound. The whole point of kernel fusion is to do memory-bound ops while data is already in registers from a compute-bound op.

GPU Memory Hierarchy

Click a level to see its size, speed, and role. Watch data flow from HBM up to registers.

Why is shared memory critical for fast GPU kernels?

It's the largest memory on the GPU It lets threads communicate without going through the CPU It's on-chip, fast, and shared among threads in a block — enabling data reuse within a tile

Chapter 2: CUDA — Grids, Blocks, Warps, Threads

CUDA is NVIDIA's programming model for GPUs. It's been around since 2007 and everything else in this lesson is built on top of it. To understand any kernel framework, you need to understand CUDA's execution model.

When you launch a CUDA kernel, you specify a grid of thread blocks. Each thread block runs on one Streaming Multiprocessor (SM). An H100 has 132 SMs. Each block contains up to 1024 threads, and those threads execute in groups of 32 called warps.

Grid

The entire computation. You choose its shape: e.g. 128 × 128 blocks.

↓ contains

Thread Block

Up to 1024 threads. Runs on one SM. Shares SMEM. You choose its shape: e.g. 256 threads.

↓ divided into

Warp (32 threads)

The actual unit of execution. All 32 threads execute the SAME instruction in lockstep (SIMT).

↓ each thread has

Thread

Own registers + thread ID. Computes one element or a small tile.

Here's the simplest possible kernel — vector addition:

cuda
__global__ void vecAdd(float* A, float* B, float* C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];
}
// Launch: vecAdd<<<(N+255)/256, 256>>>(A, B, C, N);
// Grid has ceil(N/256) blocks, each block has 256 threads.
// Thread i loads A[i] and B[i] from HBM, adds them, writes C[i].

Each thread computes one element. blockIdx.x tells you which block you're in, threadIdx.x tells you which thread within that block. Together they give a global index.

But this kernel is memory-bound — it does 1 FLOP per 12 bytes loaded (two reads + one write, each 4 bytes). For a matmul, we need tiling: each block loads a tile of A and B into shared memory, then all threads in the block reuse that tile for many multiply-accumulate operations.

Why warps matter: When threads in a warp access consecutive memory addresses, the hardware coalesces those accesses into one wide transaction. Thread 0 reads address 0, thread 1 reads address 4, thread 2 reads address 8... and the GPU fetches all 128 bytes in one shot. If threads access random addresses, each gets its own slow transaction. Coalescing is free performance — just organize your data right.

The CUDA programming model gives you total control: you decide what each thread does, how shared memory is used, how warps synchronize. This power is why experts can write kernels that hit 80%+ of peak throughput. But it's also why CUDA code for a matrix multiply is 200+ lines of intricate C++ with manual index arithmetic, explicit memory barriers, and architecture-specific tuning.

What is a "warp" in CUDA?

A group of 32 threads that execute the same instruction in lockstep A collection of thread blocks that share L2 cache The memory bus between shared memory and registers

Chapter 3: Tensor Cores — Hardware Matrix Engines

In 2017, NVIDIA added a new kind of unit to their GPUs: the Tensor Core. Regular CUDA cores do one multiply-add per cycle per thread. A Tensor Core does an entire small matrix multiply — D = A * B + C where A, B, C, D are small matrices — in a single operation across a warp.

Generation	GPU	WMMA shape (FP16)	FP16 TFLOPS
Volta (2017)	V100	16 × 16 × 16	125
Ampere (2020)	A100	16 × 8 × 16	312
Hopper (2022)	H100	64 × 256 × 16 (WGMMA)	990
Blackwell (2024)	B200	expanded WGMMA	2250

On Volta/Ampere, you used WMMA (Warp Matrix Multiply-Accumulate): each warp cooperatively loads fragments of A and B, calls wmma::mma_sync, and gets back a fragment of D. The programmer manages loading A/B tiles from shared memory into register fragments.

Hopper changed everything with WGMMA (Warp Group MMA). Instead of one warp (32 threads), WGMMA uses a warp group (128 threads = 4 warps). The tile sizes exploded: 64 × 256 × 16 in one instruction. And WGMMA can read B directly from shared memory — you don't need to load it into registers first.

The key insight: Tensor Cores are the reason custom kernels matter. If your kernel doesn't use Tensor Cores, you're leaving 10× performance on the table. But using them requires expressing your computation as small matrix multiplies with specific shapes and alignments. Every framework in this lesson is fundamentally about making Tensor Core usage easier.

Here's what WMMA looks like in CUDA:

cuda
// Declare 16x16x16 fragments
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;

// Load tiles from shared memory into register fragments
wmma::load_matrix_sync(a_frag, smem_a + offset_a, 16);
wmma::load_matrix_sync(b_frag, smem_b + offset_b, 16);
wmma::fill_fragment(c_frag, 0.0f);

// One warp does 16x16x16 = 8192 FMAs in one instruction
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

That's 8192 multiply-accumulate operations in one instruction. Without Tensor Cores, those same 32 threads would need 256 cycles to do the same work. This is why Tensor Core utilization is the single most important metric for kernel performance on modern GPUs.

FP8 on Hopper/Blackwell: Each new generation adds narrower data types. Hopper added FP8 (8-bit floating point) — halving memory bandwidth requirements vs FP16 while doubling Tensor Core throughput. Blackwell adds NVFP4 and MXFP8 (microscaling formats). Narrower types mean more ops per byte — the arithmetic intensity goes up, and memory bandwidth matters less.

What is the fundamental operation a Tensor Core performs?

Element-wise multiply of two vectors A small matrix multiply-accumulate (D = A*B + C) across a warp A full matrix transpose in shared memory

Chapter 4: Triton — The Compiler Approach

Writing a fast CUDA matmul is 200+ lines of index math, shared memory management, and architecture-specific intrinsics. Philippe Tillet looked at this in 2019 and asked: what if a compiler handled the hard parts?

Triton lets you write GPU kernels in Python. You think in blocks (tiles), not individual threads. The compiler decides how to map your blocks to warps, how to stage data through shared memory, and when to use Tensor Cores.

python
import triton
import triton.language as tl

@triton.jit
def matmul_kernel(A, B, C, M, N, K, stride_am, stride_ak,
                   stride_bk, stride_bn, stride_cm, stride_cn,
                   BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr,
                   BLOCK_K: tl.constexpr):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    # Pointers to the tiles of A and B
    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    offs_k = tl.arange(0, BLOCK_K)

    a_ptrs = A + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak
    b_ptrs = B + offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn

    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)

    for k in range(0, K, BLOCK_K):
        a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k)
        b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k)
        acc += tl.dot(a, b)  # compiler maps this to Tensor Cores
        a_ptrs += BLOCK_K * stride_ak
        b_ptrs += BLOCK_K * stride_bk

    c_ptrs = C + offs_m[:, None] * stride_cm + offs_n[None, :] * stride_cn
    tl.store(c_ptrs, acc.to(tl.float16))

Notice what's missing: no explicit shared memory allocation, no thread indexing, no __syncthreads(), no WMMA intrinsics. You write tl.dot(a, b) and the compiler figures out whether to use Tensor Cores, how to tile the dot product, and how to pipeline the loads.

Triton's deal: You give up ~10-20% of peak performance compared to expert CUDA, but you get 5× faster development time and code that's readable by anyone who knows Python. For most ML researchers, this is the right tradeoff. From 2021-2023, Triton became the default for productivity-focused kernel work.

How does the compiler work? Triton operates on blocked programs. Each program_id is one tile of the output. The compiler:

1. Parse Python

Convert decorated function to Triton IR (SSA form)

↓

2. Block-level optimization

Fuse ops, infer shared memory layout, insert barriers

↓

3. Lower to LLVM IR

Map blocks to warps/threads, insert WMMA/WGMMA intrinsics

↓

4. PTX → cubin

NVIDIA's assembler produces the final GPU binary

The limitation: Triton's compiler makes assumptions. When those assumptions match your workload (standard matmul, attention, pointwise ops), it works beautifully. When they don't (persistent kernels, warp specialization, async pipelines), the compiler can't express what you need. This is what hit the wall on Hopper.

What abstraction does Triton use instead of individual threads?

Blocks (tiles) — you think in 2D chunks, and the compiler maps them to threads Warps — you write one function per warp Streams — you describe data flow graphs

Chapter 5: CUTLASS & CuTe — Tiles as First-Class Objects

While Triton hides the hardware behind a compiler, NVIDIA's CUTLASS (CUDA Templates for Linear Algebra Subroutines) takes the opposite approach: give the programmer composable building blocks that directly express hardware operations.

CUTLASS started in 2017 as C++ templates for GEMM (General Matrix Multiply). It was powerful but verbose — a single GEMM configuration involved a dozen template parameters. In 2023, NVIDIA introduced CuTe (CUDA Tensors), a library inside CUTLASS that changed everything.

CuTe's big idea: A Layout is a first-class object that maps logical coordinates (row, column) to physical memory addresses. By separating what you're computing (the tile shape) from where data lives (the layout), you can write kernels that are both generic AND efficient. Changing from row-major to column-major is changing one template parameter, not rewriting the kernel.

In CuTe, a Tensor = data pointer + Layout. A Layout = Shape + Stride. This is the entire abstraction:

c++
using namespace cute;

// A 128x64 tile of FP16, row-major (stride = 64 between rows)
auto layout_A = make_layout(make_shape(128, 64),
                            make_stride(64, 1));

// Wrap a pointer with the layout to get a CuTe Tensor
auto tA = make_tensor(make_gmem_ptr(ptr_A), layout_A);

// Partition this tile across threads for WGMMA
auto thr_A = local_partition(tA, threadIdx.x, Shape<64,16>{});

// Copy from global memory to shared memory
copy(tA, smem_A);  // CuTe handles coalescing, bank conflicts, etc.

Why CUTLASS/CuTe for Hopper? Hopper introduced hardware features that Triton v1 couldn't express:

Feature	What It Does	Why Triton v1 Couldn't
WGMMA	4-warp MMA, reads B from SMEM	Triton assumed 1-warp MMA
TMA	Hardware DMA engine for tiles	No compiler support in 2023
Warp Specialization	Different warps do different jobs	Triton's model: all warps do same thing
Async Pipeline	Overlap compute with next load	Requires manual barrier management

This is why FlashAttention-3 (2024) was written in CUTLASS/CuTe, not Triton. The attention kernel needed all four features simultaneously to hit peak Hopper performance.

The tradeoff: CuTe gives you peak performance but requires understanding C++ templates, NVIDIA hardware details, and the CuTe abstraction itself. Development time: weeks, not days. The 2025 Tile DSL explosion is about bridging this gap — giving CuTe-level performance with Triton-level productivity.

What is the core abstraction in CuTe?

A thread function that maps thread IDs to memory addresses A Layout (Shape + Stride) that separates logical tile coordinates from physical memory A graph of operations that the compiler optimizes

Chapter 6: FlashAttention — Four Generations, Four Eras

No single kernel tells the story of GPU programming better than FlashAttention. Each version pushed the boundaries of what was possible, and each required different tools. It's our case study through the entire timeline.

The problem: Standard attention computes Q·K^T (an N×N matrix), applies softmax, then multiplies by V. For N=8192, that intermediate matrix is 8192×8192 = 256 MB in FP16. It has to be written to HBM and read back — two round-trips through slow memory for a matrix you immediately throw away.

FlashAttention's insight: Never materialize the N×N attention matrix. Compute attention in tiles: load a block of Q, stream through all blocks of K and V, accumulate the softmax-weighted output in registers. This turns attention from O(N²) memory to O(N) memory, and from memory-bound to compute-bound.

Version	Year	Written In	Key Techniques	H100 Speed
FA-1	2022	Raw CUDA	Tiled softmax, online softmax trick	~300 TFLOPS
FA-2	2023	CUDA + Triton	Parallelized over seq length, reduced SMEM	~450 TFLOPS
FA-3	2024	CUTLASS/CuTe	WGMMA, TMA, warp specialization, FP8, async pipeline	~740 TFLOPS
FA-4	2026	CuTe DSL	Blackwell WGMMA, NVFP4, CUDA Tile/TileIR	~1500 TFLOPS

FA-1 (2022): Tri Dao's original paper. Written in raw CUDA with manual SMEM management. The breakthrough was the online softmax trick — you can compute softmax incrementally, tile by tile, without ever seeing the full row. This is the algorithm that made everything else possible.

The online softmax works like this: as you process each K-tile, you track the running maximum m and running sum l. When a new tile has a larger maximum, you rescale the previous accumulator. At the end, you have the exact same result as full-matrix softmax, but computed in O(1) extra memory.

m_new = max(m_old, max(S_tile))
l_new = e^{(m_old - m_new)} · l_old + ∑ e^{(S_tile - m_new)}
O_new = (l_old/l_new) · e^{(m_old - m_new)} · O_old + (1/l_new) · e^{(S_tile - m_new)} · V_tile

FA-3 (2024): This is where it got interesting. Hopper's new hardware meant the kernel could overlap computation with data loading. While one warp group computes WGMMA on the current tile, another warp group loads the next tile via TMA. This warp specialization pattern — producer warps and consumer warps — keeps both the memory subsystem and the Tensor Cores busy simultaneously.

FA-4 (2026): Written in the CuTe DSL (a Python-friendly layer atop CuTe), targeting Blackwell's expanded WGMMA and NVFP4 support. NVIDIA's CUDA Tile/TileIR compiler infrastructure now handles much of the lowering that FA-3 did manually.

What is the core algorithmic trick in FlashAttention?

Using sparse attention to skip most key-value pairs Approximating softmax with a cheaper function Tiled online softmax — computing attention block by block without materializing the full N×N matrix

Chapter 7: TMA — Hardware-Assisted Data Movement

Before Hopper, loading data from HBM to shared memory was the programmer's job. You'd have each thread load a few elements, add a __syncthreads() barrier, then proceed. This is slow: the threads that do the loading could be doing useful math instead.

Hopper introduced the Tensor Memory Accelerator (TMA) — a dedicated hardware unit that copies multi-dimensional tiles between HBM and shared memory without involving any threads. You set up a TMA descriptor (shape, stride, data type, base address) on the CPU, pass it to the kernel, and then a single thread issues one instruction: "TMA, go fetch this tile." The TMA unit does the rest — including address calculation, bounds checking, and format conversion — while all your threads are free to compute.

Before TMA (Ampere)

128 threads each load 1 float → 128 loads → barrier → compute. Threads wasted on loads.

↓ vs.

With TMA (Hopper)

1 thread issues TMA copy → 128 threads compute on previous tile → async barrier when TMA done.

TMA descriptors support up to 5D tensors. You specify the global tensor shape and the box (tile) shape you want to copy. TMA handles corner cases — if the tile extends beyond the tensor boundary, it fills those elements with zeros or clamps them.

c++
// CPU: create TMA descriptor
CUtensorMap tma_desc;
cuTensorMapEncodeTiled(&tma_desc,
    CU_TENSOR_MAP_DATA_TYPE_FLOAT16,   // dtype
    2,                                   // 2D tensor
    ptr_global,                          // base pointer
    {M, K},                              // global shape
    {K * 2, 2},                          // global strides (bytes)
    {128, 64},                            // tile (box) shape
    ...);

// GPU: one thread issues async copy
if (threadIdx.x == 0) {
    cp_async_bulk_tensor_2d_global_to_shared(
        &smem_buf, &tma_desc, tile_row, tile_col, barrier);
}
// All other threads: compute on the PREVIOUS tile
// Wait for TMA to finish
barrier.arrive_and_wait();

Why TMA matters so much: On Hopper, a single SM can sustain ~1.5 TB/s of TMA bandwidth — nearly the full per-SM share of HBM bandwidth. Before TMA, threads had to be manually orchestrated to achieve this, wasting cycles on address math. TMA makes the async pipeline pattern practical: prefetch tile N+1 while computing on tile N. FlashAttention-3's speed comes largely from this overlap.

TMA also enables multicast on Hopper: a single TMA operation can deliver the same tile to multiple SMs' shared memories simultaneously. This is critical for all-reduce patterns in distributed training and for broadcasting shared KV tiles across attention heads.

What does TMA (Tensor Memory Accelerator) do?

Accelerates tensor math operations in registers Copies multi-dimensional tiles between HBM and shared memory using dedicated hardware, freeing threads for compute Manages L2 cache eviction policies for tensors

Chapter 8: The Tile DSL Explosion of 2025

By early 2025, the GPU kernel world had a clear problem. CUDA/CUTLASS gave peak performance but required weeks of expert work. Triton gave fast development but couldn't access Hopper's best features. The gap between them was where a dozen Tile DSLs sprouted.

Every one of these DSLs shares the same core idea: express computation as operations on tiles, and let a compiler lower them to hardware. They differ in what language you write in, what hardware features they expose, and how much control you retain.

DSL	Origin	Language	Key Idea
TileLang	Microsoft/TVM	Python	TVM-based, generates CUDA via TileIR. Used in DeepSeek-V4 production.
ThunderKittens	Stanford HAZYlab	C++ DSL	Lightweight C++ header-only. Tiles + async pipelines as first-class.
Helion	Meta / Triton team	Python	Next-gen Triton backend. Python source → Triton IR → GPU.
Tilus	Independent	Python	Minimal tile abstraction over CuTe. Research-oriented.
DeepGEMM	DeepSeek	CUDA/CuTe	JIT-compiled GEMM library. FP8 focus. Minimal code, max perf.
Gluon / TLX	Community	Python	Higher-level tile language, compiles to Triton or CUDA.
Cypress	Research	Python	Declarative tile DSL with automatic scheduling.
Tawa	Research	Python	Functional tile language with formal verification.
Mirage / MPK	CMU	Spec lang	Superoptimizer: you specify the math, it finds the kernel.

Why so many, all at once? Three forces converged: (1) Hopper exposed Triton's limits, creating demand. (2) NVIDIA open-sourced CuTe, giving everyone a target backend. (3) The LLM scaling race made kernel performance a competitive advantage — DeepSeek, Meta, Microsoft all needed custom kernels faster than CUTLASS let them iterate.

What they share: All tile DSLs express three things:

1. Tile Shape

What block of the output does this program compute? (e.g. 128×128)

↓

2. Data Movement

How do tiles flow: HBM → SMEM → registers? When to prefetch?

↓

3. Compute

What math happens on each tile? (MMA, elementwise, reduce, softmax)

Where they differ: How much of the data movement is implicit vs explicit. ThunderKittens gives you direct control over the async pipeline. TileLang's compiler decides the schedule. Mirage searches the entire space of possible implementations.

DeepGEMM deserves special mention. DeepSeek released it in early 2025 as a minimal, JIT-compiled GEMM library. It's <500 lines of core code. It uses CuTe for layout math, raw WGMMA intrinsics, and TMA — but wraps them in a JIT compiler that auto-tunes tile sizes for each shape. It hit >90% of cuBLAS performance for FP8 GEMMs with a fraction of the code. This proved that you don't need a massive framework — a thin, well-designed wrapper around the hardware primitives can be enough.

What common abstraction do all 2025 Tile DSLs share?

Expressing computation as operations on tiles, with a compiler lowering to hardware A Python frontend that generates CUDA source code A graph-based IR that supports automatic differentiation

Chapter 9: Showcase — Data Flow Through a Matmul

This is the payoff. Below is an interactive visualization of how a tiled matrix multiply moves data through the GPU memory hierarchy. You control the tile size, the pipeline depth, and can step through the execution cycle by cycle.

Watch how the tiles of A and B are loaded from HBM into shared memory, partitioned into register fragments, fed to the Tensor Core, and the accumulated result is written back. The key insight: every kernel framework in this lesson is just a different way of describing this exact data flow.

Tiled Matmul: Memory Hierarchy Data Flow

Adjust tile size and pipeline depth. Click Step to advance one cycle, or Play to animate. The left shows the matrices, the right shows the memory hierarchy.

Tile Size64

Pipeline Depth1

      Cycle:
      0
      Phase:
      Ready
    

What to notice: With pipeline depth 1, the GPU alternates between loading and computing — Tensor Cores are idle half the time. Increase pipeline depth to 2 or 3 and watch how load and compute overlap: the GPU loads tile N+1 while computing on tile N. This is the async pipeline pattern that makes Hopper kernels fast. Every Tile DSL is fundamentally about making this pattern easy to express.

Try setting the tile size to 32 and notice how many more cycles the matmul takes — more tiles means more SMEM loads. Now try 128 and see how fewer, larger tiles reduce overhead but require more SMEM per block. This is the tiling tradeoff every kernel programmer faces.

Chapter 10: The 2026 Landscape — What to Learn When

Blackwell (B200) is shipping. FlashAttention-4 targets CuTe DSL. NVIDIA's CUDA Tile/TileIR is becoming the compiler substrate. DeepSeek-V4 runs TileLang kernels in production. ThunderKittens 2.0 adds Blackwell support with MXFP8 and NVFP4. The landscape is consolidating.

Here's a decision tree for 2026:

You Are	Use This	Why
ML researcher prototyping attention variants	Triton or Helion	Fast iteration, good-enough perf, Python-native
Infra engineer optimizing serving latency	TileLang or ThunderKittens	Hopper/Blackwell features, production-ready, reasonable learning curve
Kernel expert writing FlashAttention-class kernels	CuTe + WGMMA + TMA	Full hardware control, peak performance, nothing hidden
Building a compiler or autotuner	CUDA Tile/TileIR	NVIDIA's official IR, target for all future tooling
Need GEMMs only, FP8, minimal code	DeepGEMM	<500 lines, JIT-tuned, 90%+ cuBLAS perf

The pattern: The kernel ecosystem follows the same arc as every systems layer: (1) raw hardware access (CUDA), (2) high-level compiler (Triton), (3) composable library (CuTe), (4) domain-specific languages (Tile DSLs). Each layer doesn't replace the one below — it builds on it. Expert CUDA isn't going away. Triton isn't going away. The new DSLs are the next layer up.

Kernel DSL Timeline (2021–2026)

Hover over each era to see the key developments. The vertical axis shows abstraction level.

What's converging: NVIDIA's CUDA Tile/TileIR is becoming the common backend. TileLang compiles to TileIR. Helion (Triton's successor) will target TileIR. ThunderKittens is adding TileIR codegen. This means the fragmentation of 2025 is likely temporary — by 2027, most DSLs will share a compiler backend, and competition will be on ergonomics and expressiveness rather than code generation.

What to learn right now:

Step 1

Understand the memory hierarchy and tiling (Chapters 1-2 of this lesson)

↓

Step 2

Write a Triton matmul and fused softmax. Get comfortable with the block programming model.

↓

Step 3

Read FlashAttention-2's code. Understand online softmax and tiled attention.

↓

Step 4

Learn CuTe layouts. Read DeepGEMM's source (~500 lines). Understand WGMMA + TMA.

↓

Step 5

Pick one Tile DSL (TileLang or ThunderKittens) and port a kernel to it.

Closing thought: "The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise." — Edsger Dijkstra. Every tool in this lesson is a new semantic level for GPU programming. The hierarchy doesn't collapse — it grows. Understanding the full stack, from registers to Tile DSLs, is what separates someone who uses kernels from someone who writes them.

For an ML researcher prototyping a new attention variant in 2026, what's the best starting point?

Triton or Helion — fast iteration in Python with good-enough performance Raw CUDA — always start from the lowest level DeepGEMM — it's the simplest code