Tazi et al., Chapter 4

Tensor Parallelism

Split weight matrices across GPUs. Column & row sharding, sequence parallelism, and the critical-path communication trade-off.

Prerequisites: Chapter 3 (Data Parallelism). Understanding of matrix multiplication and all-reduce.
9
Chapters
3
Simulations
9
Quizzes

Chapter 0: Beyond ZeRO

ZeRO-3 can shard model parameters, gradients, and optimizer states across GPUs. But it has a limitation: before computing each layer, it must all-gather the full layer weights onto every GPU. The activations during the matrix multiplication are still unsharded.

When activations become the memory bottleneck — for large models with long sequences — we need something different. We need to split the actual computation of a matrix multiplication across GPUs, so that no single GPU ever holds the full activation tensor.

Tensor parallelism in one sentence: Shard each weight matrix across GPUs so that each GPU computes only a slice of each matrix multiplication, keeping both weights and activations distributed.

This works because of a fundamental mathematical property: matrix multiplication can be decomposed along either the column or row dimension. If Y = XW, we can either split W by columns (and concatenate results) or split W by rows (and sum results). Let us see how.

Check: Why might ZeRO-3 not be enough for very large models?

Chapter 1: Column Sharding

Start with a weight matrix W of shape (in_features, out_features). Column-parallel sharding splits W along the output dimension into N shards: W = [W1, W2, ..., WN].

Each GPU receives the full input X (via broadcast) and one column shard Wi. It computes Yi = X · Wi. The result Yi is a partial output. To reconstruct the full result, we all-gather the partial outputs: Y = [Y1, Y2, ..., YN].

Broadcast X
Full input replicated on every GPU
Local matmul: Yi = X · Wi
Each GPU computes with its column shard
All-gather
Concatenate partial results Y = [Y1, ..., YN]
Key insight: Column sharding splits the output dimension. Each GPU produces a narrow slice of the output. This is perfect for layers where we want to split along independent output features.
Check: In column-parallel sharding, what operation combines the partial results?

Chapter 2: Row Sharding

Now split W along the input dimension (rows): W = [W1; W2; ...; WN]. This also requires splitting the input X = [X1, X2, ..., XN] via scatter.

Each GPU computes Yi = Xi · Wi. The partial results are the correct shape but need to be summed (not concatenated) for the final result. This requires an all-reduce.

Scatter X
Split input across GPUs: each gets Xi
Local matmul: Yi = Xi · Wi
Each GPU computes with its row shard
All-reduce
Sum partial results: Y = ∑ Yi
Column vs. Row: Column sharding needs broadcast (input) + all-gather (output). Row sharding needs scatter (input) + all-reduce (output). The choice depends on what comes before and after in the network.
Check: Row-parallel sharding combines partial results using...

Chapter 3: Inside Transformer Blocks

A transformer layer has two main blocks: the MLP (feedforward) and the multi-head attention (MHA). We can apply TP to both by cleverly pairing column and row sharding.

MLP block: The first linear layer (FC1) uses column sharding; the second (FC2) uses row sharding. This means we only need one all-reduce per MLP block in the forward pass (at the output of FC2). The input broadcast for FC1 is free since inputs are already synced.

Attention block: The Q, K, V projections are column-parallel (each GPU handles a subset of attention heads). The output projection is row-parallel. Again, one all-reduce per attention block.

Natural parallelism: Multi-head attention is designed to be parallelized — each head operates independently. Column-parallel TP just assigns different heads to different GPUs. For GQA (grouped query attention), the TP degree should not exceed the number of KV heads, or you need careful synchronization.
Tensor Parallelism in a Transformer Layer

Shows how column-linear and row-linear combine inside MLP and Attention blocks. Click blocks to highlight data flow.

Check: In the TP layout for a transformer, how many all-reduce operations happen per layer in the forward pass?

Chapter 4: Communication Cost

Here is the critical difference between TP and data parallelism: in DP, communication (all-reduce gradients) happens between layers and can be overlapped with backward computation. In TP, the all-reduce happens within each layer, on the critical path of the forward pass.

The TP trade-off: TP communication is in the critical path — the forward pass cannot continue until the all-reduce completes. This means TP performance depends heavily on communication bandwidth. Fast intra-node NVLink (~900 GB/s) is fine; slow inter-node InfiniBand (~100 GB/s) kills performance.

Benchmarks show significant throughput drops when scaling TP beyond 8 GPUs (the typical node size):

TP DegreeRelative ThroughputCommunication
TP=1100% (baseline)None
TP=4~90%Intra-node NVLink
TP=8~80%Intra-node NVLink
TP=16~45%Crosses node boundary (InfiniBand)
TP=32~25%Multi-node InfiniBand
Rule of thumb: Keep TP within a single node (TP ≤ 8 for most clusters). Use pipeline parallelism for cross-node distribution instead.
Check: Why does TP performance drop sharply at TP=16?

Chapter 5: Sequence Parallelism

TP shards the activations along the hidden dimension for MLP and attention computations. But operations like LayerNorm and dropout need the full hidden dimension to compute correctly (LayerNorm computes mean and variance across all hidden features).

This means after the TP region, we must gather the full activations for LayerNorm, partially negating the memory savings.

Sequence parallelism (SP) solves this by splitting activations along the sequence dimension for operations outside the TP region. Since LayerNorm operates independently on each token, splitting tokens across GPUs works perfectly.

SP Region (LayerNorm, Dropout)
Activations split along sequence dim: [seq/N, hidden]
↓ all-gather (transitions to TP)
TP Region (Attention, MLP)
Activations split along hidden dim: [seq, hidden/N]
↓ reduce-scatter (transitions to SP)
SP Region (LayerNorm, Dropout)
Activations split along sequence dim: [seq/N, hidden]
SP does not add communication cost: Vanilla TP uses two all-reduce ops per transformer block. TP+SP uses two all-gathers and two reduce-scatters. Since all-reduce = all-gather + reduce-scatter, the total communication volume is the same. But now the maximum activation size per GPU drops from [seq, hidden] to max([seq/N, hidden], [seq, hidden/N]).
Check: What dimension does sequence parallelism split along?

Chapter 6: TP+SP in Practice

Let us track exactly what happens to the activation shape as data flows through a transformer layer with TP+SP:

LocationTP OnlyTP + SP
Enter column-linearh: sharded, s: fullh: sharded, s: all-gather to full
TP regionh: sharded, s: fullh: sharded, s: full
Exit row-linearh: full (all-reduce)h: full (reduce-scatter to s: sharded)
SP region (LN, dropout)h: full, s: fullh: full, s: sharded

The maximum activation tensor on any GPU is now [seq/N, hidden] rather than [seq, hidden]. For TP+SP=16, this allows fitting sequence lengths of 16K tokens that would be impossible with TP alone.

Embedding layer: The embedding layer is also row-parallel (sharded on vocabulary). With SP, the output gets reduce-scattered along the sequence dimension, consistent with the rest of the SP regions.

Benchmarks confirm: TP+SP enables significantly larger batch sizes per GPU through activation memory savings, with the same communication cost as vanilla TP. The performance drop beyond TP=8 (crossing node boundaries) remains the same limiting factor.

Check: Does TP+SP increase total communication volume compared to vanilla TP?

Chapter 7: TP Scaling Simulator

Explore the trade-off between TP degree, throughput, and memory. Higher TP reduces per-GPU memory but adds communication overhead.

Tensor Parallelism Explorer

Adjust TP degree and model size. Observe the throughput/memory trade-off.

TP degree 8
Model (B params) 7
Check: At which TP degree does the biggest performance cliff typically appear?

Chapter 8: Summary

TechniqueWhat it shardsCommunicationBest for
Column TPWeights along output dimBroadcast + all-gatherFirst linear in MLP, Q/K/V projections
Row TPWeights along input dimScatter + all-reduceSecond linear in MLP, output projection
TP+SPWeights + activations (hidden & seq)All-gather + reduce-scatterMaximum activation memory savings
What comes next: TP+SP splits activations along hidden and sequence dimensions for LayerNorm and dropout. But inside the TP region, each GPU still processes the full sequence. For very long sequences (128K+ tokens), we need context parallelism — splitting the sequence for the attention computation itself.
Check: TP is best kept within a single node. What technique is better for cross-node model distribution?