Ch 6: Pipeline Parallelism — Ultra-Scale Playbook

Chapter 0: Why Pipeline?

In the tensor parallelism chapter, we saw that TP relies on fast intra-node communication — typically NVLink between GPUs on the same machine. But what happens when we try to scale TP beyond one node?

The inter-node network (InfiniBand or Ethernet) is 10–100x slower than NVLink. Since TP puts communication on the critical path — every layer needs an all-reduce before the next layer can start — going cross-node with TP causes severe performance degradation, often around 43%.

The fundamental problem: Sequence and context parallelism help with long sequences, but they do not help if the bottleneck is the model size itself. For 70B+ parameter models, even the weights alone can exceed the memory of 4–8 GPUs on a single node.

Pipeline parallelism solves this elegantly: split the model's layers across multiple GPUs. GPU 0 gets layers 1–4, GPU 1 gets layers 5–8, and so on. Each GPU only stores and computes a portion of the model's depth.

The communication pattern is also friendlier to slow networks: instead of all-reducing within every layer (like TP), PP only passes activation tensors at the boundaries between pipeline stages — a handful of point-to-point sends across the entire model.

Key insight: PP sends activations between stages, while ZeRO-3 sends weights. Both partition the model across GPUs, but their communication patterns and trade-offs are fundamentally different.

Check: Why is TP difficult to scale beyond one node?

TP puts communication on the critical path, and inter-node bandwidth is much lower than intra-node TP requires identical GPUs TP does not support gradient accumulation

Chapter 1: Splitting Layers

The basic idea is simple: assign consecutive layers to consecutive GPUs. With a 32-layer model and 4 GPUs:

GPU (Stage)	Layers	Role
GPU 0	Layers 1–8	Embedding + first layers
GPU 1	Layers 9–16	Middle layers
GPU 2	Layers 17–24	Middle layers
GPU 3	Layers 25–32	Final layers + LM head

The forward pass flows sequentially: GPU 0 computes its layers, sends the output activation to GPU 1, which computes its layers, sends to GPU 2, and so on. The backward pass flows in reverse.

Memory for parameters: Model weights are cleanly split. If the full model needs 80 GB for parameters, each of the 4 GPUs holds only ~20 GB. However, activation memory tells a different story.

Here is the catch: while parameters are split nicely across GPUs, activation memory stays roughly the same on each GPU. Why? Each GPU handles 1/PP of the layers, but it processes PP micro-batches before the first backward pass. Storing activations for PP micro-batches across 1/PP layers is roughly the same as storing activations for 1 batch across all layers.

This observation has a practical consequence: PP does not save activation memory by default. We need clever scheduling to manage this, which is exactly what the upcoming pipeline schedules address.

Check: Does pipeline parallelism reduce activation memory per GPU?

Yes, each GPU processes fewer layers so stores fewer activations Not by default — each GPU stores activations for multiple micro-batches, offsetting the layer reduction Activation memory is always negligible

Chapter 2: The Bubble Problem

The fundamental challenge of pipeline parallelism is the pipeline bubble — idle time where GPUs wait for data from other stages.

Imagine the simplest possible PP: GPU 0 computes its forward pass, sends the result to GPU 1, waits. GPU 1 computes, sends to GPU 2, waits. While GPU 3 is computing, GPUs 0, 1, and 2 are all idle. The same happens in reverse for the backward pass.

Bubble = wasted money. If you are paying for 4 GPUs but only 1 is computing at any given time, you are wasting 75% of your budget. The pipeline bubble is the central challenge all PP schedules try to minimize.

Let us quantify this. Let t_f and t_b be the forward and backward times for one micro-batch on one stage. A common simplification is t_b ≈ 2 · t_f. With PP stages (GPUs), the ideal time for one micro-batch is (t_f + t_b). But the bubble adds (PP − 1) · (t_f + t_b) of idle time.

bubble_fraction = (PP − 1) / PP

With 4 stages, 75% of the time is wasted! With 8 stages, 87.5%. This is catastrophic for efficiency — clearly we need smarter scheduling.

The key idea: micro-batches. If we split the batch into M smaller micro-batches, we can keep multiple stages busy simultaneously. While GPU 1 processes micro-batch 1, GPU 0 can already start on micro-batch 2.

Check: With 4 pipeline stages and no micro-batching, what fraction of total time is the bubble?

75% — only 1 of 4 GPUs is active at any time 50% 25%

Chapter 3: AFAB Schedule

The simplest micro-batch schedule is All Forward, All Backward (AFAB): run all forward passes for all micro-batches first, then run all backward passes.

1. All Forward

Process micro-batches 1, 2, 3, ... M through all stages

↓

2. All Backward

Backprop through micro-batches M, M-1, ... 1 in reverse

↓

3. Optimizer step

All GPUs update their local parameters

With M micro-batches, the bubble shrinks:

bubble_fraction = (PP − 1) / (M + PP − 1)

With 4 stages and 8 micro-batches, the bubble drops from 75% to about 27%. More micro-batches means a smaller bubble.

The memory problem: AFAB stores activations for all M micro-batches in memory until the backward pass begins. With many micro-batches, this causes an activation memory explosion. We need to keep M large (to shrink the bubble) but we cannot afford the memory cost.

AFAB is the simplest PP implementation — just two loops (one forward, one backward). Its code is straightforward. But the memory explosion limits how many micro-batches we can practically use.

Check: What is the main disadvantage of the AFAB schedule?

It cannot use micro-batches It requires all-reduce at every layer It stores activations for all micro-batches until backward, causing memory explosion

Chapter 4: 1F1B Schedule

The One Forward, One Backward (1F1B) schedule addresses AFAB's memory problem by starting backward passes as soon as possible.

The schedule has three phases:

1. Warmup

Fill the pipeline — each stage performs enough forward passes to reach the last stage

↓

2. Steady state (1F1B)

Alternate: one forward pass, one backward pass

↓

3. Cooldown

Drain remaining backward passes

The bubble size is the same as AFAB. But the critical difference is memory: 1F1B only needs to store activations for PP micro-batches at a time, not all M. Since we free activations as soon as the backward pass completes for each micro-batch, we bound the memory at PP instead of M.

Schedule	Bubble	Peak activation memory
AFAB	(PP − 1) / (M + PP − 1)	Proportional to M (all micro-batches)
1F1B	(PP − 1) / (M + PP − 1)	Proportional to PP (pipeline depth)

Key insight: 1F1B lets us safely increase M (more micro-batches = smaller bubble) without blowing up memory, because we no longer hold all activations simultaneously.

The trade-off is implementation complexity. Forward and backward passes are no longer cleanly sequential — they are interleaved across devices. Each device independently decides when to switch between forward and backward, requiring careful coordination.

In benchmarks, 1F1B with many micro-batches shows much better scaling behavior. Interestingly, scaling from 1 node (PP=1) to 2 nodes (PP=2) only drops performance by about 14% — much better than TP's ~43% cross-node penalty. This makes PP especially attractive for multi-node training.

Check: How does 1F1B improve on AFAB?

It reduces peak activation memory from O(M) to O(PP) by interleaving forward and backward passes It eliminates the bubble entirely It uses fewer GPUs

Chapter 5: Interleaved Stages

1F1B reduced memory, but the bubble is still there. Can we shrink the bubble itself?

The idea: instead of assigning consecutive layers to each GPU, assign interleaved layers. With 8 layers and 2 GPUs:

GPU	Naive (contiguous)	Interleaved (v=2 chunks)
GPU 0	Layers 1–4	Layers 1, 2, 5, 6
GPU 1	Layers 5–8	Layers 3, 4, 7, 8

Each GPU now holds v "model chunks" (v = number of stages per GPU). A micro-batch loops through the pipeline v times, visiting each GPU v times instead of once.

The bubble shrinks by a factor of v:
bubble_fraction = (PP − 1) / (v · M + PP − 1)
With v=2 stages per GPU, the bubble is halved compared to basic 1F1B.

But there is a cost: communication increases by a factor of v as well. Each micro-batch traverses the pipeline v times, so there are v times more activation transfers between GPUs. This is a direct trade-off between bubble size and communication volume.

Scheduling becomes more complex too. At each time step, a GPU must decide: should it process an earlier micro-batch through later layers (depth-first: finish batches quickly) or a later micro-batch through earlier layers (breadth-first: fill the pipeline faster)? This choice matters for memory and throughput.

Llama 3.1's approach: Meta's Llama 3.1 used 1F1B with interleaved stages and a tunable depth-first vs. breadth-first priority. This gave them fine control over the bubble-vs-communication trade-off.

Check: What is the trade-off of interleaved stages?

Smaller bubble, but more communication (each micro-batch visits each GPU multiple times) Larger bubble, but less memory Same bubble, but simpler code

Chapter 6: Zero Bubble & DualPipe

Can we eliminate the bubble entirely? Recent work by Sea AI Lab and DeepSeek suggests we can get close.

The key observation: the backward pass through a matrix multiplication is actually two separate operations:

Operation	Symbol	Purpose	When needed?
Input backward	B	Compute gradient for inputs (needed by earlier layers)	Immediately — on the critical path
Weight backward	W	Compute gradient for weights (needed by optimizer)	Anytime before the optimizer step

Key insight: The weight gradient (W) is not on the critical path. It can be computed at any time between the input backward (B) and the optimizer step. This lets us use W operations to fill the bubble.

The ZB-H2 schedule exploits this by scheduling W operations into the idle slots of the pipeline, achieving a theoretically zero-bubble schedule. Finding the optimal placement requires solving an Integer Linear Programming (ILP) problem to minimize idle time.

DeepSeek's DualPipe extends this further: two streams of micro-batches propagate from both ends of the pipeline simultaneously, and their forward/backward operations are interleaved to maximize GPU utilization. DeepSeek reported "near-zero all-to-all communication overhead" for their V3/R1 models using this approach.

Complexity trade-off: Zero-bubble and DualPipe schedules are too complex for simple code snippets. They require fine-grained profiling of every operation and automatic scheduling. But the concepts are clear: decompose the backward pass, then schedule the flexible parts to fill the gaps.

Check: What is the key insight that enables zero-bubble pipeline schedules?

Using more GPUs Using larger micro-batches Splitting the backward into input grad (critical path) and weight grad (flexible), then scheduling W to fill idle slots

Chapter 7: Pipeline Schedule Simulator

Watch pipeline schedules in action. Compare how AFAB, 1F1B, and interleaved schedules fill GPUs over time. Gray blocks are the bubble — idle GPU time.

Pipeline Schedule Simulator

Select a schedule and click Play to watch micro-batches flow through the pipeline.

Schedule: AFAB — Step 0

Bubble Fraction Calculator

Drag the sliders to see how micro-batches and interleave factor affect the bubble.

PP stages: 4 Micro-batches: 4 Interleave (v): 1

Check: With PP=4, M=8 micro-batches, and no interleaving (v=1), what is the bubble fraction?

43% ~27% — (4-1)/(8+4-1) = 3/11 75%

Chapter 8: Summary

Schedule	Bubble	Memory	Complexity
Naive (no micro-batches)	(PP−1)/PP — catastrophic	Low	Trivial
AFAB	(PP−1)/(M+PP−1)	O(M) — high	Simple
1F1B	(PP−1)/(M+PP−1)	O(PP) — bounded	Moderate
Interleaved	(PP−1)/(v·M+PP−1)	O(PP)	Complex
Zero Bubble	≈ 0	O(PP)	Very complex

What comes next: We have seen how to split data (DP), weights along hidden dimension (TP), sequences (CP), and layers (PP). One dimension remains: for MoE models, we can split experts across GPUs. That is expert parallelism — Chapter 7.

PP vs. ZeRO-3: Both partition the model across GPUs. PP communicates activations and requires careful scheduling to minimize the bubble. ZeRO-3 communicates weights and requires large batches to overlap communication with computation. In practice, they are rarely combined, but ZeRO-1/2 can be combined with PP easily (e.g., DeepSeek-V3 used PP + ZeRO-1).

Check: Which technique does PP communicate between stages — weights or activations?

Weights (like ZeRO-3) Activations — each stage sends its output to the next stage Gradients only