Ch 3: Sharded Matmul — Scaling Book

Chapter 0: Why Shard?

When you train an LLM on ten thousand chips, you are still doing abstractly the same computation as on one chip. The difference is that your arrays do not fit in the HBM of a single chip.

We "shard" (split) arrays across devices. Sometimes for memory — the model simply does not fit. Sometimes for speed — even if it fits on fewer chips, using more gives us more FLOPs/s. During inference, we often choose larger topologies to reduce latency rather than because we need the memory.

The art of scaling: Figuring out how to shard your model so computation remains efficient. Different shardings lead to different communication patterns and different costs. This chapter gives you the tools to analyze them all.

Consider a 2D array A[I, J] sharded across 4 TPUs in a 2×2 mesh:

	Y=0	Y=1
X=0	A[0:I/2, 0:J/2]	A[0:I/2, J/2:J]
X=1	A[I/2:I, 0:J/2]	A[I/2:I, J/2:J]

The global (logical) shape is still (I, J). But the local shape on each device is (I/2, J/2) — each chip holds 1/4 of the total array.

A bf16[4096, 8192] array is sharded across 16 chips with both axes split 4 ways. How many bytes per chip?

2 × (4096/4) × (8192/4) = 4 MiB 2 × 4096 × 8192 / 16 = 4 MiB Both of the above (they are the same calculation)

Chapter 1: Sharding Notation

We use a clean notation: subscripts on array dimensions tell you which mesh axis they are sharded across.

The Device Mesh

A device mesh is a named grid of devices. Example: Mesh({'X': 4, 'Y': 2}) is an 8-device grid with axis names X and Y.

The Sharding

A sharding assigns mesh axes to array dimensions using subscripts:

Notation	Meaning	Local Shape (for I=1024, J=4096)
A[I, J]	Fully replicated on every device	(1024, 4096) × 8 copies
A[I_X, J]	I sharded across X, J replicated across Y	(256, 4096)
A[I_X, J_Y]	I sharded across X, J across Y	(256, 2048)
A[I_XY, J]	I sharded across both X and Y (flattened)	(128, 4096)

Rule: A mesh axis can appear at most once per tensor. A[I_X, J_X] is forbidden — you cannot shard two different tensor dimensions along the same mesh axis.

Replication

When a dimension is NOT subscripted with a mesh axis, the data is replicated along that axis. For example, A[I_X, J] on Mesh({'X': 4, 'Y': 2}) means I is split 4 ways across X, but J is fully present on every device — with 2 complete copies (one per Y-plane).

In JAX code

import jax
mesh = jax.make_mesh((4, 2), ('X', 'Y'))
# A[I_X, J_Y] sharding:
A = jnp.zeros((1024, 4096), device=P('X', 'Y'))
# A[I, J_Y] (I replicated, J sharded):
B = jnp.zeros((2048, 4096), device=P(None, 'Y'))

Array int8[128, 2048] with sharding A[I_XY, J] on Mesh({'X': 2, 'Y': 8, 'Z': 2}). How much memory per device?

1 × (128/16) × 2048 = 16,384 bytes (sharded over XY, replicated over Z) 1 × 128 × 2048 / 32 = 8,192 bytes (sharded over all) 1 × (128/2) × (2048/8) = 16,384 bytes

Chapter 2: Case 1 — No Sharded Contraction

The simplest case: neither input has a sharded contracting dimension. No communication is needed at all.

The contracting dimension is the one being summed over. In A[I, J] · B[J, K] → C[I, K], J is the contracting dimension.

Rule: When the contracting dimension J is unsharded, each device can independently multiply its local chunks. The output inherits shardings from both inputs.

All of these work with zero communication:

A[I, J] · B[J, K] → C[I, K]

A[I_X, J] · B[J, K] → C[I_X, K]

A[I, J] · B[J, K_Y] → C[I, K_Y]

A[I_X, J] · B[J, K_Y] → C[I_X, K_Y]

Think about why: each device has a complete slice of the contracting dimension. The multiplications along J are fully local. The non-contracting dimensions just come along for the ride.

The last example is particularly powerful: A[I_X, J] · B[J, K_Y] → C[I_X, K_Y] gives you a result sharded across both mesh axes with zero comms. This is the basis of many efficient parallelism strategies.

A[I_X, J] · B[J, K_X] → C[I_X, K_X]. Does this work with no communication?

Yes, J is unsharded so it's Case 1 No! Both non-contracting dims are sharded along X. This is invalid (Case 4) Only if X has wraparound links

Chapter 3: Case 2 — AllGather

When one input has its contracting dimension sharded, we cannot do a local multiply directly. We need to first gather the complete contracting dimension onto every device.

A[I, J_X] · B[J, K] → C[I, K]

A is sharded along J (the contracting dim), but B expects the full J. Solution:

AllGather_X(A[I, J_X]) → A[I, J]

A[I, J] · B[J, K] → C[I, K]

What is an AllGather?

An AllGather removes a sharding subscript: each device sends its shard around a ring until every device has a full copy.

AllGather Animation

Watch 8 devices exchange shards around a ring. Each starts with 1/8 of the array and ends with a full copy.

Step 0 / 4

How long does it take?

For V total bytes sharded across X devices using bidirectional ring (with wraparounds):

T_AllGather = V / W_ici (bidirectional)

Remarkable fact: The AllGather time does not depend on X (the number of devices). More devices means smaller shards but more hops, and these cancel perfectly. You are purely bottlenecked by the link bandwidth.

When do we enter a latency-bound regime? When each shard is so small that per-hop overhead (~1 μs) dominates. For TPU v5e with 4.5e10 unidirectional bandwidth, any buffer under ~45 kB will be latency-bound.

Multi-axis AllGather

Gathering over multiple axes increases available bandwidth by a factor of N_axes:

T = V / (W_ici × N_axes)

AllGather_Y([E_Y, F]) on TPU v5e 8×4 mesh, E=2048, F=8192, bf16. Total array = 34 MB. The 4-chip Y axis has no wraparound. Time is closest to:

~377 μs (34e6 / 9e10, assuming full bidi ring) ~560 μs (3 hops of 8.4 MB each at 4.5e10 unidirectional, no wraparound) ~3 μs (latency-bound)

Chapter 4: Case 3 — AllReduce

When both inputs are sharded along the contracting dimension on the same mesh axis:

A[I, J_X] · B[J_X, K] → C[I, K]

Here, each device can do a local matmul of its partial shard. But the result on each device is only a partial sum — we need to add them all up.

We write this using the "unreduced" notation {U_X}:

A[I, J_X] ·_LOCAL B[J_X, K] → C[I, K] {U_X}

The partial sums are then resolved with an AllReduce:

AllReduce_X(C[I, K] {U_X}) → C[I, K]

AllReduce = ReduceScatter + AllGather

An AllReduce can be decomposed into two cheaper operations:

ReduceScatter_X,K(C[I, K] {U_X}) → C[I, K_X]

AllGather_X(C[I, K_X]) → C[I, K]

This decomposition is crucial because often we want the result sharded anyway. In that case, we can skip the final AllGather and just use the ReduceScatter, saving half the communication.

Communication costs

T_AllReduce = 2 × V / W_ici

That is 2x the cost of an AllGather, because we do a ReduceScatter (V / W_ici) then an AllGather (V / W_ici).

ReduceScatter defers the AllGather: If the next operation wants a sharded input anyway, we can do a ReduceScatter instead of an AllReduce and save half the comms. This is extremely common in practice — each layer can accept sharded input and produce sharded output.

A[I, J_X] · B[J_X, K] with ReduceScatter instead of AllReduce. What is the output sharding?

C[I, K_X] — the ReduceScatter sums the partial products and shards K along X C[I, K] — fully replicated C[I_X, K] — ReduceScatter always shards the first dimension

Chapter 5: Case 4 — Invalid Shardings

The fourth case is when both non-contracting dimensions are sharded along the same mesh axis:

A[I_X, J] · B[J, K_X] → C[I_X, K_X] — INVALID

This is invalid because device i along X would hold the (i, i)-th block of C — a diagonal entry. There is not enough information to reconstruct the full matrix.

Resolution: AllGather one of the inputs to remove the conflicting subscript. You have two choices, and the right one depends on context:

AllGather_X(A[I_X, J]) → A[I, J], then A[I, J] · B[J, K_X] → C[I, K_X]

or:

AllGather_X(B[J, K_X]) → B[J, K], then A[I_X, J] · B[J, K] → C[I_X, K]

In both cases, the result only mentions X once. Which you pick depends on which sharding the downstream operations need, and on the relative sizes of the arrays (gather the smaller one if comms is the bottleneck).

The four cases at a glance

Case	Condition	Communication
1	Neither input sharded on contracting dim	None
2	One input sharded on contracting dim	AllGather the sharded input
3	Both sharded on contracting dim (same axis)	Local matmul + AllReduce (or ReduceScatter)
4	Both sharded on same axis (non-contracting)	AllGather one input first

A[I_X, J_Y] · B[J_Y, K] → ? Which case is this?

Case 1 (no sharded contraction) Case 2 (one input has sharded contracting dim J_Y, need AllGather) Case 3 (both sharded on contracting dim)

Chapter 6: Communication Costs

Here is the complete cost model for all four communication primitives, assuming we are in the bandwidth-bound regime (arrays large enough that per-hop latency is negligible).

Operation	What It Does	Syntax	Time
AllGather	Removes sharding subscript, replicates	[A_X, B] → [A, B]	V / (W_ici × N_axes)
ReduceScatter	Sums partial products, introduces sharding	[A, B] {U_X} → [A_X, B]	Same as AllGather
AllReduce	Sums partial products, keeps replicated	[A, B] {U_X} → [A, B]	2 × AllGather
AllToAll	Moves subscript between dims	[A, B_X] → [A_X, B]	AllGather / 4

The stunning result: None of these costs depend on the number of devices (in the bandwidth-bound regime). Whether you have 4 or 256 chips, the cost only depends on the total array size and the per-link bandwidth. More chips means smaller shards but proportionally more hops.

Why AllToAll is 4x cheaper

In an AllGather, every shard must reach every device: each shard hops across the full ring. In an AllToAll, shard i only needs to reach device i. On average, that is N/4 hops (half the ring, with bidirectional sending). This gives a factor of 4 savings.

The latency-bound regime

When arrays are very small (< ~45 kB per hop on TPU v5e), per-hop latency (~1 μs) dominates:

T_total = max(T_min × |X| / 2, V / W_ici)

In this regime, more devices does increase communication time. This matters during autoregressive generation where buffers are tiny.

An AllReduce is 2x the cost of an AllGather because:

It is a ReduceScatter (= 1 AllGather cost) followed by an AllGather (another 1 AllGather cost) It sends data in both directions instead of one It has to do arithmetic (summation) at each hop

Chapter 7: AllToAll & ReduceScatter Details

AllToAll

The AllToAll moves a subscript from one dimension to another:

AllToAll_X,J(A[I_X, J]) → A[I, J_X]

Think of it as a distributed transpose. It arises naturally in Mixture-of-Experts models (routing tokens to experts on different devices) and when resharding between computation phases that need different layouts.

For a 1D mesh, the cost is V / (4 × W_ici). For an ND mesh with axes of sizes A, B, C:

T_AllToAll = V × max(A, B, C, ...) / (4 × N × W_ici)

ReduceScatter: the derivative of AllGather

This is a deeper fact than it first appears. If the forward pass does:

AllGather_X(A[I_X]) → A[I]

Then the backward pass does:

ReduceScatter_X(dA[I] {U_X}) → dA[I_X]

And vice versa. This is because broadcast and reduce are transposes of each other as linear operators, and AllGather/ReduceScatter are their Kronecker products with the identity.

Practical consequence: In training, every AllGather in the forward pass becomes a ReduceScatter in the backward pass (and vice versa). The total training communication is determined by the forward-pass communication pattern plus its "mirror image."

Collective matmul: overlapping comms with compute

In practice, we overlap the AllGather/ReduceScatter with the matmul itself using a technique called collective matmul. The idea: start the matmul on available chunks while the remaining chunks are still being gathered. Each chunk's matmul overlaps with the next chunk's network transfer.

This is how we approach the max(T_math, T_comms) lower bound in practice.

In training, if the forward pass uses AllGather_X(W[D_X, F]) to gather weights, the backward pass will use:

ReduceScatter_X(dW[D, F] {U_X}) → dW[D_X, F] Another AllGather to send gradients AllReduce of the gradients

Chapter 8: Exercises

Exercise 1: Replicated sharding overhead

Array A[I_X, J, K, ...] on Mesh({'X': 4, 'Y': 8, 'Z': 2}). Only sharded across X. What is the ratio of total bytes across all chips to one copy of A?

Answer: Each chip holds sizeof(A)/4. The array is replicated across Y and Z, so total = sizeof(A) × Y × Z / 4 × 4 = sizeof(A) × Y × Z = sizeof(A) × 16. Ratio = 16.

Exercise 2: AllGather latency

AllGather_X([B_X, D_Y]) on TPU v4p 4×4×4, B=1024, D=4096, bf16. Mesh{'X':4, 'Y':4, 'Z':4}.

We gather over X only. The array on each Y-shard: bf16[256, 1024] = 0.5 MB. Total gathered = bf16[1024, 1024] = 2 MB per Y-shard.

T = 2BD / (Y × W_ici) = 2×1024×4096 / (4 × 9e10) = 23 μs

Exercise 3: Latency-bound AllGather

AllGather_X([B_X]) with B=128 in bf16 on TPU v4p 4×4×4. Total = 256 bytes, 64 bytes per device. Each hop takes ~0 bandwidth time. With wraparound on X=4, just 2 hops needed: ~2 μs.

Exercise 4: AllGather vs. AllReduce strategy

X[B, D] · Y[D_X, F] → Z[B, F]. Two strategies:

Strategy 1: AllGather Y first, then matmul. Cost: max(2BDF/C, 2DF/W).

Strategy 2: Treat as Case 3 (local matmul + AllReduce). Cost: max(2BDF/(X×C), 4BF/W).

Strategy 2 does 1/X fewer FLOPs but AllReduce costs 4BF/W. When D > 2B, Strategy 2 can be better for comms-bound cases. But it requires the contracting dim to be sharded on both inputs, which is uncommon in practice (e.g., FSDP shards params and activations along the same axis).

On a TPU v4p 4×4×4 with Mesh{'X':4,'Y':4,'Z':4} and W_ici=9e10 bidi, how long does AllReduce_Z([B_X, D_Y] {U_Z}) take? B=1024, D=4096, bf16.

~11.6 μs: each shard is 2BD/(X×Y) bytes, AllReduce = 4BD/(X×Y×W) ~93 μs: 2×full array / W ~23 μs: same as the AllGather

Chapter 9: Summary

Everything boils down to four cases and four primitives. Once you know the sharding of your inputs and the desired output, you can determine exactly what communication is needed and how long it takes.

Primitive	Effect	Cost
AllGather	Removes subscript: [A_X] → [A]	V / W
ReduceScatter	Sums + shards: [A]{U_X} → [A_X]	V / W
AllReduce	Sums: [A]{U_X} → [A]	2V / W
AllToAll	Moves subscript: [A_X, B] → [A, B_X]	V / (4W)

Key insights:

1. Communication costs do not depend on the number of devices (in the bandwidth-bound regime).

2. ReduceScatter and AllGather are transposes of each other. Every AllGather in forward = ReduceScatter in backward.

3. AllToAll is 4x cheaper than AllGather — use it when you just need to move a sharding subscript.

4. Collective matmul overlaps comms with compute, approaching the max(T_math, T_comms) bound.

You need to reshard A[I_X, J] to A[I, J_X]. The cheapest primitive is:

AllGather then re-shard manually AllToAll — directly moves the subscript at 1/4 the cost of AllGather AllReduce

Sharded Matrices and How to Multiply Them