Ring Attention, Zig-Zag balancing, and ultra-long sequences — splitting the sequence itself across GPUs.
Modern LLMs are pushing to ever-longer context windows: 128K, 256K, even 1M tokens. But activation memory grows quadratically with sequence length due to the attention mechanism.
Even with TP+SP and full activation recomputation, we still store activations at layer boundaries, and those scale linearly with sequence length. At 128K tokens, these boundary activations alone can exceed the memory of an entire node.
Context parallelism (CP) addresses this by splitting the input sequence across GPUs for all parts of the model, including the attention computation. Most modules (MLP, LayerNorm) process tokens independently, so splitting is free. The trick is handling attention, where every token needs to see every other token's keys and values.
With context parallelism, we split the input tokens evenly across GPUs along the sequence dimension. If we have 4 GPUs and a 16-token sequence, each GPU gets 4 tokens along with their Q, K, V vectors.
For most operations, this split is trivial:
| Operation | Impact of sequence split |
|---|---|
| MLP / FFN | Each token processed independently — no communication needed |
| LayerNorm | Per-token operation — no communication needed |
| Attention | Each token needs K/V from all other tokens — communication required! |
After computing gradients, an all-reduce synchronizes gradients across the CP group, just like in data parallelism. The critical question is: how do we handle the attention communication efficiently?
The key innovation is Ring Attention: arrange GPUs in a logical ring and pass K/V pairs around the ring while overlapping this communication with local attention computation.
Here is the algorithm for N GPUs, at each time step:
After N rounds, every GPU has computed attention against all K/V pairs from the entire sequence, even though it only ever stored one chunk in memory at a time.
There is a serious problem with naive Ring Attention: causal attention masking creates a load imbalance.
In causal (autoregressive) attention, token i can only attend to tokens 1 through i. If we assign tokens sequentially — GPU 1 gets tokens 1–4, GPU 2 gets tokens 5–8, etc. — then GPU 1 has very little work (its tokens only attend to a few predecessors) while the last GPU has the most work.
GPU 1 can complete its local attention immediately (it has all the tokens it needs). GPU 4 must wait for N-1 rounds to receive K/V from all earlier GPUs. So some GPUs finish early and sit idle while others are still computing.
The fix is elegant: instead of assigning tokens sequentially, interleave early and late tokens on each GPU. This is called Zig-Zag Attention.
With 16 tokens and 4 GPUs, instead of:
| GPU | Sequential (unbalanced) | Zig-Zag (balanced) |
|---|---|---|
| GPU 0 | tokens 1–4 | tokens 1, 8, 9, 16 |
| GPU 1 | tokens 5–8 | tokens 2, 7, 10, 15 |
| GPU 2 | tokens 9–12 | tokens 3, 6, 11, 14 |
| GPU 3 | tokens 13–16 | tokens 4, 5, 12, 13 |
Each GPU now has a mix of early and late tokens. In the causal attention matrix, the colored squares (non-masked elements) are distributed evenly across GPUs, balancing both compute and communication.
There are two ways to implement the K/V exchange: all-gather (collect all K/V at once, like ZeRO-3) or all-to-all (ring) (pass chunks incrementally). All-gather is simpler but uses more temporary memory; the ring approach is more memory-efficient.
Watch Ring Attention in action. Each GPU starts with its local K/V, computes attention, then passes K/V to the next GPU in the ring. After N steps, every GPU has seen all K/V pairs.
Click Step to advance one round. Watch K/V chunks rotate around the ring.
| Technique | What it does | Trade-off |
|---|---|---|
| Context Parallelism | Splits sequence across GPUs for all operations | Attention needs cross-GPU K/V exchange |
| Ring Attention | Rotates K/V around a ring, overlapping with compute | N communication steps for N GPUs |
| Zig-Zag Attention | Interleaves tokens to balance causal masking | Slightly more complex token assignment |