From CUDA threads to Tile DSLs — why there are so many kernel frameworks and which to use when.
You call torch.matmul(A, B) and it takes 4 milliseconds. An expert writes a custom CUDA kernel for the same operation and it takes 0.3 milliseconds. That's a 13x difference — same GPU, same data, same math.
Where does the gap come from? PyTorch launches a generic kernel that works for any shape, any dtype, any device. It can't assume your matrices are square. It can't fuse the ReLU you're about to apply next. It doesn't know you'll call this 10,000 times with the same shape. The generic kernel pays for flexibility with performance.
A custom kernel knows exactly what it's doing. It tiles the matrices to fit in fast on-chip memory. It fuses multiple operations into one pass. It uses Tensor Cores — special hardware units that do 16x16 matrix multiplies in a single clock cycle. It overlaps memory loads with computation so the GPU never stalls waiting for data.
This gap spawned an entire ecosystem of tools. CUDA (2007) gave you raw control but required expert-level C++. Triton (2021) let you write Python and get good kernels. CUTLASS gave template libraries for structured patterns. And in 2025, a dozen "Tile DSLs" appeared, each trying to find the sweet spot between productivity and performance.
By the end of this lesson, you'll understand why each exists, what problem it solves, and which to reach for in 2026.
Click each implementation to see how close it gets to hardware peak. The gap is what custom kernels close.
Every GPU performance problem is a memory problem. The arithmetic units on an H100 can multiply matrices at 990 trillion operations per second. But the memory that feeds them can only deliver 3.35 TB/s. If you do the math: to keep the ALUs busy, every byte loaded from memory must fuel ~296 FP16 operations. If your operation does fewer ops per byte (low arithmetic intensity), the GPU starves.
The solution is a memory hierarchy — multiple levels of storage, each faster and smaller than the last. Data starts in slow, massive HBM and gets staged through progressively faster caches until it reaches the registers where math actually happens.
| Level | Size (H100) | Bandwidth | Latency |
|---|---|---|---|
| Registers | 256 KB per SM | ~20 TB/s | ~1 cycle |
| Shared Memory | 228 KB per SM | ~15 TB/s | ~20 cycles |
| L2 Cache | 50 MB total | ~6 TB/s | ~200 cycles |
| HBM (Global) | 80 GB total | 3.35 TB/s | ~400 cycles |
Registers are the fastest. Each thread has its own private registers — accessing them costs just one clock cycle. But you only get a few hundred per thread, and they're invisible to other threads.
Shared memory (SMEM) is the key to fast kernels. It sits on-chip, shared among all threads in a thread block. It's 6x faster than HBM but only 228 KB per streaming multiprocessor. The entire art of GPU kernel writing is about loading a tile of data from HBM into shared memory, doing as much computation as possible on that tile, then writing the result back.
L2 cache is shared across the entire GPU. It helps when multiple thread blocks access nearby data, but you don't control what stays in L2 — the hardware does.
HBM (High Bandwidth Memory) is the "main memory" of the GPU. 80 GB is a lot, but at 400+ cycle latency, every trip to HBM is expensive. The golden rule: minimize HBM round-trips.
Click a level to see its size, speed, and role. Watch data flow from HBM up to registers.
CUDA is NVIDIA's programming model for GPUs. It's been around since 2007 and everything else in this lesson is built on top of it. To understand any kernel framework, you need to understand CUDA's execution model.
When you launch a CUDA kernel, you specify a grid of thread blocks. Each thread block runs on one Streaming Multiprocessor (SM). An H100 has 132 SMs. Each block contains up to 1024 threads, and those threads execute in groups of 32 called warps.
Here's the simplest possible kernel — vector addition:
cuda __global__ void vecAdd(float* A, float* B, float* C, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Launch: vecAdd<<<(N+255)/256, 256>>>(A, B, C, N); // Grid has ceil(N/256) blocks, each block has 256 threads. // Thread i loads A[i] and B[i] from HBM, adds them, writes C[i].
Each thread computes one element. blockIdx.x tells you which block you're in, threadIdx.x tells you which thread within that block. Together they give a global index.
But this kernel is memory-bound — it does 1 FLOP per 12 bytes loaded (two reads + one write, each 4 bytes). For a matmul, we need tiling: each block loads a tile of A and B into shared memory, then all threads in the block reuse that tile for many multiply-accumulate operations.
The CUDA programming model gives you total control: you decide what each thread does, how shared memory is used, how warps synchronize. This power is why experts can write kernels that hit 80%+ of peak throughput. But it's also why CUDA code for a matrix multiply is 200+ lines of intricate C++ with manual index arithmetic, explicit memory barriers, and architecture-specific tuning.
In 2017, NVIDIA added a new kind of unit to their GPUs: the Tensor Core. Regular CUDA cores do one multiply-add per cycle per thread. A Tensor Core does an entire small matrix multiply — D = A * B + C where A, B, C, D are small matrices — in a single operation across a warp.
| Generation | GPU | WMMA shape (FP16) | FP16 TFLOPS |
|---|---|---|---|
| Volta (2017) | V100 | 16 × 16 × 16 | 125 |
| Ampere (2020) | A100 | 16 × 8 × 16 | 312 |
| Hopper (2022) | H100 | 64 × 256 × 16 (WGMMA) | 990 |
| Blackwell (2024) | B200 | expanded WGMMA | 2250 |
On Volta/Ampere, you used WMMA (Warp Matrix Multiply-Accumulate): each warp cooperatively loads fragments of A and B, calls wmma::mma_sync, and gets back a fragment of D. The programmer manages loading A/B tiles from shared memory into register fragments.
Hopper changed everything with WGMMA (Warp Group MMA). Instead of one warp (32 threads), WGMMA uses a warp group (128 threads = 4 warps). The tile sizes exploded: 64 × 256 × 16 in one instruction. And WGMMA can read B directly from shared memory — you don't need to load it into registers first.
Here's what WMMA looks like in CUDA:
cuda // Declare 16x16x16 fragments wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag; wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag; wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag; // Load tiles from shared memory into register fragments wmma::load_matrix_sync(a_frag, smem_a + offset_a, 16); wmma::load_matrix_sync(b_frag, smem_b + offset_b, 16); wmma::fill_fragment(c_frag, 0.0f); // One warp does 16x16x16 = 8192 FMAs in one instruction wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
That's 8192 multiply-accumulate operations in one instruction. Without Tensor Cores, those same 32 threads would need 256 cycles to do the same work. This is why Tensor Core utilization is the single most important metric for kernel performance on modern GPUs.
Writing a fast CUDA matmul is 200+ lines of index math, shared memory management, and architecture-specific intrinsics. Philippe Tillet looked at this in 2019 and asked: what if a compiler handled the hard parts?
Triton lets you write GPU kernels in Python. You think in blocks (tiles), not individual threads. The compiler decides how to map your blocks to warps, how to stage data through shared memory, and when to use Tensor Cores.
python import triton import triton.language as tl @triton.jit def matmul_kernel(A, B, C, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr): pid_m = tl.program_id(0) pid_n = tl.program_id(1) # Pointers to the tiles of A and B offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = A + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak b_ptrs = B + offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, K, BLOCK_K): a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k) b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k) acc += tl.dot(a, b) # compiler maps this to Tensor Cores a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c_ptrs = C + offs_m[:, None] * stride_cm + offs_n[None, :] * stride_cn tl.store(c_ptrs, acc.to(tl.float16))
Notice what's missing: no explicit shared memory allocation, no thread indexing, no __syncthreads(), no WMMA intrinsics. You write tl.dot(a, b) and the compiler figures out whether to use Tensor Cores, how to tile the dot product, and how to pipeline the loads.
How does the compiler work? Triton operates on blocked programs. Each program_id is one tile of the output. The compiler:
The limitation: Triton's compiler makes assumptions. When those assumptions match your workload (standard matmul, attention, pointwise ops), it works beautifully. When they don't (persistent kernels, warp specialization, async pipelines), the compiler can't express what you need. This is what hit the wall on Hopper.
While Triton hides the hardware behind a compiler, NVIDIA's CUTLASS (CUDA Templates for Linear Algebra Subroutines) takes the opposite approach: give the programmer composable building blocks that directly express hardware operations.
CUTLASS started in 2017 as C++ templates for GEMM (General Matrix Multiply). It was powerful but verbose — a single GEMM configuration involved a dozen template parameters. In 2023, NVIDIA introduced CuTe (CUDA Tensors), a library inside CUTLASS that changed everything.
In CuTe, a Tensor = data pointer + Layout. A Layout = Shape + Stride. This is the entire abstraction:
c++ using namespace cute; // A 128x64 tile of FP16, row-major (stride = 64 between rows) auto layout_A = make_layout(make_shape(128, 64), make_stride(64, 1)); // Wrap a pointer with the layout to get a CuTe Tensor auto tA = make_tensor(make_gmem_ptr(ptr_A), layout_A); // Partition this tile across threads for WGMMA auto thr_A = local_partition(tA, threadIdx.x, Shape<64,16>{}); // Copy from global memory to shared memory copy(tA, smem_A); // CuTe handles coalescing, bank conflicts, etc.
Why CUTLASS/CuTe for Hopper? Hopper introduced hardware features that Triton v1 couldn't express:
| Feature | What It Does | Why Triton v1 Couldn't |
|---|---|---|
| WGMMA | 4-warp MMA, reads B from SMEM | Triton assumed 1-warp MMA |
| TMA | Hardware DMA engine for tiles | No compiler support in 2023 |
| Warp Specialization | Different warps do different jobs | Triton's model: all warps do same thing |
| Async Pipeline | Overlap compute with next load | Requires manual barrier management |
This is why FlashAttention-3 (2024) was written in CUTLASS/CuTe, not Triton. The attention kernel needed all four features simultaneously to hit peak Hopper performance.
No single kernel tells the story of GPU programming better than FlashAttention. Each version pushed the boundaries of what was possible, and each required different tools. It's our case study through the entire timeline.
The problem: Standard attention computes Q·KT (an N×N matrix), applies softmax, then multiplies by V. For N=8192, that intermediate matrix is 8192×8192 = 256 MB in FP16. It has to be written to HBM and read back — two round-trips through slow memory for a matrix you immediately throw away.
| Version | Year | Written In | Key Techniques | H100 Speed |
|---|---|---|---|---|
| FA-1 | 2022 | Raw CUDA | Tiled softmax, online softmax trick | ~300 TFLOPS |
| FA-2 | 2023 | CUDA + Triton | Parallelized over seq length, reduced SMEM | ~450 TFLOPS |
| FA-3 | 2024 | CUTLASS/CuTe | WGMMA, TMA, warp specialization, FP8, async pipeline | ~740 TFLOPS |
| FA-4 | 2026 | CuTe DSL | Blackwell WGMMA, NVFP4, CUDA Tile/TileIR | ~1500 TFLOPS |
FA-1 (2022): Tri Dao's original paper. Written in raw CUDA with manual SMEM management. The breakthrough was the online softmax trick — you can compute softmax incrementally, tile by tile, without ever seeing the full row. This is the algorithm that made everything else possible.
The online softmax works like this: as you process each K-tile, you track the running maximum m and running sum l. When a new tile has a larger maximum, you rescale the previous accumulator. At the end, you have the exact same result as full-matrix softmax, but computed in O(1) extra memory.
FA-3 (2024): This is where it got interesting. Hopper's new hardware meant the kernel could overlap computation with data loading. While one warp group computes WGMMA on the current tile, another warp group loads the next tile via TMA. This warp specialization pattern — producer warps and consumer warps — keeps both the memory subsystem and the Tensor Cores busy simultaneously.
FA-4 (2026): Written in the CuTe DSL (a Python-friendly layer atop CuTe), targeting Blackwell's expanded WGMMA and NVFP4 support. NVIDIA's CUDA Tile/TileIR compiler infrastructure now handles much of the lowering that FA-3 did manually.
Before Hopper, loading data from HBM to shared memory was the programmer's job. You'd have each thread load a few elements, add a __syncthreads() barrier, then proceed. This is slow: the threads that do the loading could be doing useful math instead.
Hopper introduced the Tensor Memory Accelerator (TMA) — a dedicated hardware unit that copies multi-dimensional tiles between HBM and shared memory without involving any threads. You set up a TMA descriptor (shape, stride, data type, base address) on the CPU, pass it to the kernel, and then a single thread issues one instruction: "TMA, go fetch this tile." The TMA unit does the rest — including address calculation, bounds checking, and format conversion — while all your threads are free to compute.
TMA descriptors support up to 5D tensors. You specify the global tensor shape and the box (tile) shape you want to copy. TMA handles corner cases — if the tile extends beyond the tensor boundary, it fills those elements with zeros or clamps them.
c++ // CPU: create TMA descriptor CUtensorMap tma_desc; cuTensorMapEncodeTiled(&tma_desc, CU_TENSOR_MAP_DATA_TYPE_FLOAT16, // dtype 2, // 2D tensor ptr_global, // base pointer {M, K}, // global shape {K * 2, 2}, // global strides (bytes) {128, 64}, // tile (box) shape ...); // GPU: one thread issues async copy if (threadIdx.x == 0) { cp_async_bulk_tensor_2d_global_to_shared( &smem_buf, &tma_desc, tile_row, tile_col, barrier); } // All other threads: compute on the PREVIOUS tile // Wait for TMA to finish barrier.arrive_and_wait();
TMA also enables multicast on Hopper: a single TMA operation can deliver the same tile to multiple SMs' shared memories simultaneously. This is critical for all-reduce patterns in distributed training and for broadcasting shared KV tiles across attention heads.
By early 2025, the GPU kernel world had a clear problem. CUDA/CUTLASS gave peak performance but required weeks of expert work. Triton gave fast development but couldn't access Hopper's best features. The gap between them was where a dozen Tile DSLs sprouted.
Every one of these DSLs shares the same core idea: express computation as operations on tiles, and let a compiler lower them to hardware. They differ in what language you write in, what hardware features they expose, and how much control you retain.
| DSL | Origin | Language | Key Idea |
|---|---|---|---|
| TileLang | Microsoft/TVM | Python | TVM-based, generates CUDA via TileIR. Used in DeepSeek-V4 production. |
| ThunderKittens | Stanford HAZYlab | C++ DSL | Lightweight C++ header-only. Tiles + async pipelines as first-class. |
| Helion | Meta / Triton team | Python | Next-gen Triton backend. Python source → Triton IR → GPU. |
| Tilus | Independent | Python | Minimal tile abstraction over CuTe. Research-oriented. |
| DeepGEMM | DeepSeek | CUDA/CuTe | JIT-compiled GEMM library. FP8 focus. Minimal code, max perf. |
| Gluon / TLX | Community | Python | Higher-level tile language, compiles to Triton or CUDA. |
| Cypress | Research | Python | Declarative tile DSL with automatic scheduling. |
| Tawa | Research | Python | Functional tile language with formal verification. |
| Mirage / MPK | CMU | Spec lang | Superoptimizer: you specify the math, it finds the kernel. |
What they share: All tile DSLs express three things:
Where they differ: How much of the data movement is implicit vs explicit. ThunderKittens gives you direct control over the async pipeline. TileLang's compiler decides the schedule. Mirage searches the entire space of possible implementations.
DeepGEMM deserves special mention. DeepSeek released it in early 2025 as a minimal, JIT-compiled GEMM library. It's <500 lines of core code. It uses CuTe for layout math, raw WGMMA intrinsics, and TMA — but wraps them in a JIT compiler that auto-tunes tile sizes for each shape. It hit >90% of cuBLAS performance for FP8 GEMMs with a fraction of the code. This proved that you don't need a massive framework — a thin, well-designed wrapper around the hardware primitives can be enough.
This is the payoff. Below is an interactive visualization of how a tiled matrix multiply moves data through the GPU memory hierarchy. You control the tile size, the pipeline depth, and can step through the execution cycle by cycle.
Watch how the tiles of A and B are loaded from HBM into shared memory, partitioned into register fragments, fed to the Tensor Core, and the accumulated result is written back. The key insight: every kernel framework in this lesson is just a different way of describing this exact data flow.
Adjust tile size and pipeline depth. Click Step to advance one cycle, or Play to animate. The left shows the matrices, the right shows the memory hierarchy.
Try setting the tile size to 32 and notice how many more cycles the matmul takes — more tiles means more SMEM loads. Now try 128 and see how fewer, larger tiles reduce overhead but require more SMEM per block. This is the tiling tradeoff every kernel programmer faces.
Blackwell (B200) is shipping. FlashAttention-4 targets CuTe DSL. NVIDIA's CUDA Tile/TileIR is becoming the compiler substrate. DeepSeek-V4 runs TileLang kernels in production. ThunderKittens 2.0 adds Blackwell support with MXFP8 and NVFP4. The landscape is consolidating.
Here's a decision tree for 2026:
| You Are | Use This | Why |
|---|---|---|
| ML researcher prototyping attention variants | Triton or Helion | Fast iteration, good-enough perf, Python-native |
| Infra engineer optimizing serving latency | TileLang or ThunderKittens | Hopper/Blackwell features, production-ready, reasonable learning curve |
| Kernel expert writing FlashAttention-class kernels | CuTe + WGMMA + TMA | Full hardware control, peak performance, nothing hidden |
| Building a compiler or autotuner | CUDA Tile/TileIR | NVIDIA's official IR, target for all future tooling |
| Need GEMMs only, FP8, minimal code | DeepGEMM | <500 lines, JIT-tuned, 90%+ cuBLAS perf |
Hover over each era to see the key developments. The vertical axis shows abstraction level.
What's converging: NVIDIA's CUDA Tile/TileIR is becoming the common backend. TileLang compiles to TileIR. Helion (Triton's successor) will target TileIR. ThunderKittens is adding TileIR codegen. This means the fragmentation of 2025 is likely temporary — by 2027, most DSLs will share a compiler backend, and competition will be on ergonomics and expressiveness rather than code generation.
What to learn right now: