Tazi et al., Chapter 4

Tensor Parallelism

Split weight matrices across GPUs. Column & row sharding, sequence parallelism, and the critical-path communication trade-off.

Prerequisites: Chapter 3 (Data Parallelism). Understanding of matrix multiplication and all-reduce.

Chapters

Simulations

Quizzes

Chapter 0: Beyond ZeRO

ZeRO-3 can shard model parameters, gradients, and optimizer states across GPUs. But it has a limitation: before computing each layer, it must all-gather the full layer weights onto every GPU. The activations during the matrix multiplication are still unsharded.

When activations become the memory bottleneck — for large models with long sequences — we need something different. We need to split the actual computation of a matrix multiplication across GPUs, so that no single GPU ever holds the full activation tensor.

Tensor parallelism in one sentence: Shard each weight matrix across GPUs so that each GPU computes only a slice of each matrix multiplication, keeping both weights and activations distributed.

This works because of a fundamental mathematical property: matrix multiplication can be decomposed along either the column or row dimension. If Y = XW, we can either split W by columns (and concatenate results) or split W by rows (and sum results). Let us see how.

Check: Why might ZeRO-3 not be enough for very large models?

Even though weights are sharded, activations are still full-sized during computation ZeRO-3 does not support mixed precision ZeRO-3 cannot be used with more than 8 GPUs

Chapter 1: Column Sharding

Start with a weight matrix W of shape (in_features, out_features). Column-parallel sharding splits W along the output dimension into N shards: W = [W₁, W₂, ..., W_N].

Each GPU receives the full input X (via broadcast) and one column shard W_i. It computes Y_i = X · W_i. The result Y_i is a partial output. To reconstruct the full result, we all-gather the partial outputs: Y = [Y₁, Y₂, ..., Y_N].

Broadcast X

Full input replicated on every GPU

↓

Local matmul: Y_i = X · W_i

Each GPU computes with its column shard

↓

All-gather

Concatenate partial results Y = [Y₁, ..., Y_N]

Key insight: Column sharding splits the output dimension. Each GPU produces a narrow slice of the output. This is perfect for layers where we want to split along independent output features.

Check: In column-parallel sharding, what operation combines the partial results?

All-reduce (sum) All-gather (concatenate) Reduce-scatter

Chapter 2: Row Sharding

Now split W along the input dimension (rows): W = [W₁; W₂; ...; W_N]. This also requires splitting the input X = [X₁, X₂, ..., X_N] via scatter.

Each GPU computes Y_i = X_i · W_i. The partial results are the correct shape but need to be summed (not concatenated) for the final result. This requires an all-reduce.

Scatter X

Split input across GPUs: each gets X_i

↓

Local matmul: Y_i = X_i · W_i

Each GPU computes with its row shard

↓

All-reduce

Sum partial results: Y = ∑ Y_i

Column vs. Row: Column sharding needs broadcast (input) + all-gather (output). Row sharding needs scatter (input) + all-reduce (output). The choice depends on what comes before and after in the network.

Check: Row-parallel sharding combines partial results using...

All-reduce (summing across GPUs) All-gather (concatenation) No communication needed

Chapter 3: Inside Transformer Blocks

A transformer layer has two main blocks: the MLP (feedforward) and the multi-head attention (MHA). We can apply TP to both by cleverly pairing column and row sharding.

MLP block: The first linear layer (FC1) uses column sharding; the second (FC2) uses row sharding. This means we only need one all-reduce per MLP block in the forward pass (at the output of FC2). The input broadcast for FC1 is free since inputs are already synced.

Attention block: The Q, K, V projections are column-parallel (each GPU handles a subset of attention heads). The output projection is row-parallel. Again, one all-reduce per attention block.

Natural parallelism: Multi-head attention is designed to be parallelized — each head operates independently. Column-parallel TP just assigns different heads to different GPUs. For GQA (grouped query attention), the TP degree should not exceed the number of KV heads, or you need careful synchronization.

Tensor Parallelism in a Transformer Layer

Shows how column-linear and row-linear combine inside MLP and Attention blocks. Click blocks to highlight data flow.

Check: In the TP layout for a transformer, how many all-reduce operations happen per layer in the forward pass?

One (only after MLP) Two (one after Attention, one after MLP) Four (two per block)

Chapter 4: Communication Cost

Here is the critical difference between TP and data parallelism: in DP, communication (all-reduce gradients) happens between layers and can be overlapped with backward computation. In TP, the all-reduce happens within each layer, on the critical path of the forward pass.

The TP trade-off: TP communication is in the critical path — the forward pass cannot continue until the all-reduce completes. This means TP performance depends heavily on communication bandwidth. Fast intra-node NVLink (~900 GB/s) is fine; slow inter-node InfiniBand (~100 GB/s) kills performance.

Benchmarks show significant throughput drops when scaling TP beyond 8 GPUs (the typical node size):

TP Degree	Relative Throughput	Communication
TP=1	100% (baseline)	None
TP=4	~90%	Intra-node NVLink
TP=8	~80%	Intra-node NVLink
TP=16	~45%	Crosses node boundary (InfiniBand)
TP=32	~25%	Multi-node InfiniBand

Rule of thumb: Keep TP within a single node (TP ≤ 8 for most clusters). Use pipeline parallelism for cross-node distribution instead.

Check: Why does TP performance drop sharply at TP=16?

Too many parameters to shard Communication crosses the node boundary to slower interconnect The model has too few attention heads

Chapter 5: Sequence Parallelism

TP shards the activations along the hidden dimension for MLP and attention computations. But operations like LayerNorm and dropout need the full hidden dimension to compute correctly (LayerNorm computes mean and variance across all hidden features).

This means after the TP region, we must gather the full activations for LayerNorm, partially negating the memory savings.

Sequence parallelism (SP) solves this by splitting activations along the sequence dimension for operations outside the TP region. Since LayerNorm operates independently on each token, splitting tokens across GPUs works perfectly.

SP Region (LayerNorm, Dropout)

Activations split along sequence dim: [seq/N, hidden]

↓ all-gather (transitions to TP)

TP Region (Attention, MLP)

Activations split along hidden dim: [seq, hidden/N]

↓ reduce-scatter (transitions to SP)

SP Region (LayerNorm, Dropout)

Activations split along sequence dim: [seq/N, hidden]

SP does not add communication cost: Vanilla TP uses two all-reduce ops per transformer block. TP+SP uses two all-gathers and two reduce-scatters. Since all-reduce = all-gather + reduce-scatter, the total communication volume is the same. But now the maximum activation size per GPU drops from [seq, hidden] to max([seq/N, hidden], [seq, hidden/N]).

Check: What dimension does sequence parallelism split along?

The hidden dimension The sequence dimension The batch dimension

Chapter 6: TP+SP in Practice

Let us track exactly what happens to the activation shape as data flows through a transformer layer with TP+SP:

Location	TP Only	TP + SP
Enter column-linear	h: sharded, s: full	h: sharded, s: all-gather to full
TP region	h: sharded, s: full	h: sharded, s: full
Exit row-linear	h: full (all-reduce)	h: full (reduce-scatter to s: sharded)
SP region (LN, dropout)	h: full, s: full	h: full, s: sharded

The maximum activation tensor on any GPU is now [seq/N, hidden] rather than [seq, hidden]. For TP+SP=16, this allows fitting sequence lengths of 16K tokens that would be impossible with TP alone.

Embedding layer: The embedding layer is also row-parallel (sharded on vocabulary). With SP, the output gets reduce-scattered along the sequence dimension, consistent with the rest of the SP regions.

Benchmarks confirm: TP+SP enables significantly larger batch sizes per GPU through activation memory savings, with the same communication cost as vanilla TP. The performance drop beyond TP=8 (crossing node boundaries) remains the same limiting factor.

Check: Does TP+SP increase total communication volume compared to vanilla TP?

Yes, it doubles the communication No — it replaces all-reduce with equivalent all-gather + reduce-scatter It depends on the sequence length

Chapter 7: TP Scaling Simulator

Explore the trade-off between TP degree, throughput, and memory. Higher TP reduces per-GPU memory but adds communication overhead.

Tensor Parallelism Explorer

Adjust TP degree and model size. Observe the throughput/memory trade-off.

TP degree 8

Model (B params) 7

Check: At which TP degree does the biggest performance cliff typically appear?

TP=4 (half a node) TP=8 to TP=16 (crossing the node boundary) TP=2 (starting to parallelize)

Chapter 8: Summary

Technique	What it shards	Communication	Best for
Column TP	Weights along output dim	Broadcast + all-gather	First linear in MLP, Q/K/V projections
Row TP	Weights along input dim	Scatter + all-reduce	Second linear in MLP, output projection
TP+SP	Weights + activations (hidden & seq)	All-gather + reduce-scatter	Maximum activation memory savings

What comes next: TP+SP splits activations along hidden and sequence dimensions for LayerNorm and dropout. But inside the TP region, each GPU still processes the full sequence. For very long sequences (128K+ tokens), we need context parallelism — splitting the sequence for the attention computation itself.

Check: TP is best kept within a single node. What technique is better for cross-node model distribution?

More data parallelism Gradient accumulation Pipeline parallelism (lower bandwidth requirements)