Split model layers across nodes. AFAB, 1F1B, interleaved stages, zero-bubble schedules — taming the pipeline bubble.
In the tensor parallelism chapter, we saw that TP relies on fast intra-node communication — typically NVLink between GPUs on the same machine. But what happens when we try to scale TP beyond one node?
The inter-node network (InfiniBand or Ethernet) is 10–100x slower than NVLink. Since TP puts communication on the critical path — every layer needs an all-reduce before the next layer can start — going cross-node with TP causes severe performance degradation, often around 43%.
Pipeline parallelism solves this elegantly: split the model's layers across multiple GPUs. GPU 0 gets layers 1–4, GPU 1 gets layers 5–8, and so on. Each GPU only stores and computes a portion of the model's depth.
The communication pattern is also friendlier to slow networks: instead of all-reducing within every layer (like TP), PP only passes activation tensors at the boundaries between pipeline stages — a handful of point-to-point sends across the entire model.
The basic idea is simple: assign consecutive layers to consecutive GPUs. With a 32-layer model and 4 GPUs:
| GPU (Stage) | Layers | Role |
|---|---|---|
| GPU 0 | Layers 1–8 | Embedding + first layers |
| GPU 1 | Layers 9–16 | Middle layers |
| GPU 2 | Layers 17–24 | Middle layers |
| GPU 3 | Layers 25–32 | Final layers + LM head |
The forward pass flows sequentially: GPU 0 computes its layers, sends the output activation to GPU 1, which computes its layers, sends to GPU 2, and so on. The backward pass flows in reverse.
Here is the catch: while parameters are split nicely across GPUs, activation memory stays roughly the same on each GPU. Why? Each GPU handles 1/PP of the layers, but it processes PP micro-batches before the first backward pass. Storing activations for PP micro-batches across 1/PP layers is roughly the same as storing activations for 1 batch across all layers.
This observation has a practical consequence: PP does not save activation memory by default. We need clever scheduling to manage this, which is exactly what the upcoming pipeline schedules address.
The fundamental challenge of pipeline parallelism is the pipeline bubble — idle time where GPUs wait for data from other stages.
Imagine the simplest possible PP: GPU 0 computes its forward pass, sends the result to GPU 1, waits. GPU 1 computes, sends to GPU 2, waits. While GPU 3 is computing, GPUs 0, 1, and 2 are all idle. The same happens in reverse for the backward pass.
Let us quantify this. Let tf and tb be the forward and backward times for one micro-batch on one stage. A common simplification is tb ≈ 2 · tf. With PP stages (GPUs), the ideal time for one micro-batch is (tf + tb). But the bubble adds (PP − 1) · (tf + tb) of idle time.
With 4 stages, 75% of the time is wasted! With 8 stages, 87.5%. This is catastrophic for efficiency — clearly we need smarter scheduling.
The key idea: micro-batches. If we split the batch into M smaller micro-batches, we can keep multiple stages busy simultaneously. While GPU 1 processes micro-batch 1, GPU 0 can already start on micro-batch 2.
The simplest micro-batch schedule is All Forward, All Backward (AFAB): run all forward passes for all micro-batches first, then run all backward passes.
With M micro-batches, the bubble shrinks:
With 4 stages and 8 micro-batches, the bubble drops from 75% to about 27%. More micro-batches means a smaller bubble.
AFAB is the simplest PP implementation — just two loops (one forward, one backward). Its code is straightforward. But the memory explosion limits how many micro-batches we can practically use.
The One Forward, One Backward (1F1B) schedule addresses AFAB's memory problem by starting backward passes as soon as possible.
The schedule has three phases:
The bubble size is the same as AFAB. But the critical difference is memory: 1F1B only needs to store activations for PP micro-batches at a time, not all M. Since we free activations as soon as the backward pass completes for each micro-batch, we bound the memory at PP instead of M.
| Schedule | Bubble | Peak activation memory |
|---|---|---|
| AFAB | (PP − 1) / (M + PP − 1) | Proportional to M (all micro-batches) |
| 1F1B | (PP − 1) / (M + PP − 1) | Proportional to PP (pipeline depth) |
The trade-off is implementation complexity. Forward and backward passes are no longer cleanly sequential — they are interleaved across devices. Each device independently decides when to switch between forward and backward, requiring careful coordination.
In benchmarks, 1F1B with many micro-batches shows much better scaling behavior. Interestingly, scaling from 1 node (PP=1) to 2 nodes (PP=2) only drops performance by about 14% — much better than TP's ~43% cross-node penalty. This makes PP especially attractive for multi-node training.
1F1B reduced memory, but the bubble is still there. Can we shrink the bubble itself?
The idea: instead of assigning consecutive layers to each GPU, assign interleaved layers. With 8 layers and 2 GPUs:
| GPU | Naive (contiguous) | Interleaved (v=2 chunks) |
|---|---|---|
| GPU 0 | Layers 1–4 | Layers 1, 2, 5, 6 |
| GPU 1 | Layers 5–8 | Layers 3, 4, 7, 8 |
Each GPU now holds v "model chunks" (v = number of stages per GPU). A micro-batch loops through the pipeline v times, visiting each GPU v times instead of once.
But there is a cost: communication increases by a factor of v as well. Each micro-batch traverses the pipeline v times, so there are v times more activation transfers between GPUs. This is a direct trade-off between bubble size and communication volume.
Scheduling becomes more complex too. At each time step, a GPU must decide: should it process an earlier micro-batch through later layers (depth-first: finish batches quickly) or a later micro-batch through earlier layers (breadth-first: fill the pipeline faster)? This choice matters for memory and throughput.
Can we eliminate the bubble entirely? Recent work by Sea AI Lab and DeepSeek suggests we can get close.
The key observation: the backward pass through a matrix multiplication is actually two separate operations:
| Operation | Symbol | Purpose | When needed? |
|---|---|---|---|
| Input backward | B | Compute gradient for inputs (needed by earlier layers) | Immediately — on the critical path |
| Weight backward | W | Compute gradient for weights (needed by optimizer) | Anytime before the optimizer step |
The ZB-H2 schedule exploits this by scheduling W operations into the idle slots of the pipeline, achieving a theoretically zero-bubble schedule. Finding the optimal placement requires solving an Integer Linear Programming (ILP) problem to minimize idle time.
DeepSeek's DualPipe extends this further: two streams of micro-batches propagate from both ends of the pipeline simultaneously, and their forward/backward operations are interleaved to maximize GPU utilization. DeepSeek reported "near-zero all-to-all communication overhead" for their V3/R1 models using this approach.
Watch pipeline schedules in action. Compare how AFAB, 1F1B, and interleaved schedules fill GPUs over time. Gray blocks are the bubble — idle GPU time.
Select a schedule and click Play to watch micro-batches flow through the pipeline.
Drag the sliders to see how micro-batches and interleave factor affect the bubble.
| Schedule | Bubble | Memory | Complexity |
|---|---|---|---|
| Naive (no micro-batches) | (PP−1)/PP — catastrophic | Low | Trivial |
| AFAB | (PP−1)/(M+PP−1) | O(M) — high | Simple |
| 1F1B | (PP−1)/(M+PP−1) | O(PP) — bounded | Moderate |
| Interleaved | (PP−1)/(v·M+PP−1) | O(PP) | Complex |
| Zero Bubble | ≈ 0 | O(PP) | Very complex |