Counting parameters, estimating FLOPs, choosing parallelism configs, and calculating training time — all by hand.
The LLaMA 3 family includes three main models: 8B, 70B, and 405B parameters. In this chapter we focus on the 70B variant, leaving the others as exercises.
Here is the architecture for LLaMA 3-70B, taken from the HuggingFace model config:
| Hyperparameter | Symbol | Value |
|---|---|---|
| Layers | L | 80 |
| Model dimension | D | 8,192 |
| FFN dimension | F | 28,672 |
| Attention heads | N | 64 |
| KV heads | K | 8 (GQA) |
| Head dimension | H | 128 |
| Vocabulary size | V | 128,256 |
Comparing the three LLaMA 3 models:
| Hyperparameter | 8B | 70B | 405B |
|---|---|---|---|
| Layers (L) | 32 | 80 | 126 |
| Model dim (D) | 4,096 | 8,192 | 16,384 |
| FFN dim (F) | 14,336 | 28,672 | 53,248 |
| Attention heads (N) | 32 | 64 | 128 |
| KV heads (K) | 8 | 8 | 8 |
| Head dim (H) | 128 | 128 | 128 |
| Vocab (V) | 128,256 | 128,256 | 128,256 |
Notice the patterns: H is always 128, V is always 128,256, K is always 8 (GQA). The models scale primarily by increasing D, F, and L. F/D ≈ 3.5 for all three, consistent with the SwiGLU scaling convention. Also note that 405B has K=8 like 70B — the same number of KV heads regardless of model size. This is a deliberate design choice: more KV heads do not help quality much, but they dramatically increase KV cache size during inference.
The configuration is all you need to derive every important number: parameter count, FLOPs, memory, training time, and cost. Let us start.
Where to find these numbers: Every model on HuggingFace has a config.json that lists these hyperparameters. The mapping from HuggingFace names to our symbols:
| HuggingFace key | Symbol |
|---|---|
num_hidden_layers | L |
hidden_size | D |
intermediate_size | F |
num_attention_heads | N |
num_key_value_heads | K |
vocab_size | V |
The head dimension H is typically D/N = 8192/64 = 128 for LLaMA models, but it is also listed as head_dim in some configs.
It is useful to make a spreadsheet with these numbers for many open-source LLMs. You will quickly see patterns: most models use H=128, most use SwiGLU with F ≈ 8/3 × D (rounded to a multiple of 256), and GQA is now standard.
Comparison with other model families:
| Model | Params | D | F | L | F/D | KV heads |
|---|---|---|---|---|---|---|
| LLaMA 3-70B | 70B | 8,192 | 28,672 | 80 | 3.5 | 8 |
| Mistral-7B | 7B | 4,096 | 14,336 | 32 | 3.5 | 8 |
| GPT-3 | 175B | 12,288 | 49,152 | 96 | 4.0 | 96 (MHA) |
| DeepSeek-V3 | 671B | 7,168 | 18,432 | 61 | 2.6 | MLA |
All modern models converge on F/D ≈ 3-4 (SwiGLU) and GQA with few KV heads. GPT-3 used full MHA with 96 KV heads — its inference KV cache would be 12x larger than LLaMA 70B. DeepSeek-V3 uses Multi-head Latent Attention (MLA), compressing KV representations differently.
Let us derive the 70B parameter count from the config table. Every parameter in the Transformer falls into one of three groups:
| Component | Formula | Count |
|---|---|---|
| FFN (SwiGLU) | D × F × 3 × L (gate + up + down projections) | 8,192 × 28,672 × 3 × 80 = 56.3B |
| Attention | L × [2 × D × N × H + 2 × D × K × H] (Q, O projections + K, V projections) | 80 × (2 × 8192 × 64 × 128 + 2 × 8192 × 8 × 128) = 12.0B |
| Embeddings | 2 × V × D (input + output embeddings) | 2 × 128,256 × 8,192 = 2.1B |
| RMSNorm | 2 × D × L (pre-attn + pre-FFN) | 2 × 8192 × 80 = 1.3M (negligible) |
| Total | 70.4B |
Notice RMSNorm adds only 1.3M parameters — completely negligible. Rotary positional embeddings (RoPE) have zero stored parameters; they are computed on the fly. The parameter count is entirely dominated by the linear projections.
Why SwiGLU has 3 weight matrices instead of 2: A standard FFN has two matrices: up-projection (D → F) and down-projection (F → D). SwiGLU adds a gate matrix (D → F) that modulates the up-projection element-wise: output = (gate × SiLU(gate_proj)) × down_proj. Three matrices, each of size D × F. This is why the FFN count is D × F × 3 × L, not D × F × 2 × L.
Note on F for SwiGLU: Since SwiGLU has 3 matrices instead of 2, the FFN width F is typically reduced to keep the total parameter count similar to a standard FFN. The rule of thumb: F ≈ 8D/3 rounded to a multiple of 256. For LLaMA 70B: 8 × 8192 / 3 = 21845, but the actual value 28672 is larger, likely chosen for efficiency reasons (nice multiple of chip dimensions).
Let us verify the attention count step by step. Each layer has:
A standard rule of thumb: a training step (forward + backward) uses approximately 6 × parameter count FLOPs per token. The factor of 6 comes from: 2 for the forward pass (multiply-accumulate) × 3 for forward + two backward passes.
That is about half a teraFLOP per token per step. On a single TPU v5p chip (459 TFLOPS bf16):
This assumes we are compute-bound and achieving near-peak FLOPs. In practice, we target 30-50% MFU (Model FLOPs Utilization).
Breaking down the 6x factor: Where does the "6 × params" rule come from?
This is an approximation. It ignores attention QK^T and softmax (which add ~4BSNH FLOPs per layer), LayerNorm, and activation functions. For long sequences, the attention FLOPs can be significant. But for the standard regime (seq < 8K, large FFN), the 6P rule is accurate to within ~5%.
Checking the 6x rule against exact computation:
The exact number is about 50% higher than the simple 6P/4 estimate because attention projections add significant FLOPs. The 6P rule works well for total training FLOPs but undercounts when breaking down per-layer costs. For back-of-the-envelope total training time, it is good enough.
Detailed FLOPs breakdown by operation (forward pass only, per token):
| Operation | FLOPs/token/layer | All 80 layers |
|---|---|---|
| FFN (3 projections) | 3 × 2 × 8192 × 28672 = 1.41e9 | 1.13e11 |
| Attn projections (Q,K,V,O) | 2 × 8192 × (2×8192 + 2×1024) = 3.02e8 | 2.41e10 |
| Attn QK^T + softmax×V | 4 × 64 × 4096 × 128 = 1.34e8 | 1.07e10 |
| Total forward | 1.48e11 | |
| Total fwd+bwd (×3) | 4.44e11 |
The exact count of 4.44e11 vs our 6P estimate of 4.2e11 shows the rule is accurate to within 6%. Good enough for all practical purposes.
LLaMA 3 was trained for approximately 15 trillion tokens. Total training FLOPs:
That is 6.3 yottaFLOPs. On a single TPU v5p, this would take:
Now let us estimate the training time on a full TPU v5p pod (8960 chips) at 40% MFU:
Let us sanity-check this. Each training step processes 4M tokens. FLOPs per step = 4.2e11 × 4e6 = 1.68e18. Time per step = 1.68e18 / (8960 × 4.59e14 × 0.4) = 1.02 seconds. Number of steps = 15e12 / 4e6 = 3.75M steps. Total time = 3.75e6 × 1.02 = 3.83e6 seconds = 44.3 days. The two approaches agree.
Steps per day: 3.75e6 steps / 44.3 days = ~84,600 steps/day, or about 1 step per second. This is important for monitoring — if your training run suddenly drops to 0.5 steps/second, you know something is wrong (communication bottleneck, hardware failure, or checkpointing overhead).
Let us also estimate the dollar cost. TPU v5p chips cost approximately $4.20/hour on Google Cloud:
This is a rough estimate. Real costs include fault tolerance overhead (checkpointing, restarts), networking costs, storage, and the engineering team. The actual LLaMA 3 report quotes training on 16,384 H100 GPUs.
Hidden costs beyond compute:
| Cost component | Multiplier | Notes |
|---|---|---|
| GPU/TPU compute | 1x (base) | The $40M we computed |
| Fault tolerance overhead | +5-15% | Time lost to restarts, checkpointing |
| Data preparation | +5-10% | Tokenization, dedup, quality filtering |
| Evaluation runs | +10-20% | Benchmark suites, ablation studies |
| Infrastructure (networking, storage) | +10-20% | High-speed interconnects, data storage |
| Engineering team | +20-50% | Salaries for 10-50 engineers over 6-12 months |
| Realistic total | 1.5-2.5x | $60-100M for a 70B model |
| Topology | Chips | Training time (40% MFU) | Approximate cost |
|---|---|---|---|
| 1 TPU v5p | 1 | 435 years | $16M |
| 1/4 pod | 2240 | 176 days | $40M |
| Full pod | 8960 | 44 days | $40M |
| 4 pods | 35840 | 11 days | $40M |
Why is time-to-completion so important? In practice, faster training is worth paying a premium for. Reasons:
1. Iteration speed: If an experiment takes 44 days, you get feedback 4x faster than on a quarter-pod (176 days). Over a year of development, this means 4x more experiments.
2. Competitive pressure: In the frontier lab race, shipping a model 3 months earlier can be the difference between leading and lagging.
3. Reliability: Longer training runs are more likely to encounter hardware failures. Google's LLaMA 3 report mentions that their 16K GPU cluster experienced an interruption roughly every 3 hours on average.
4. MFU usually improves with scale: Larger clusters can support higher batch sizes, which improves compute utilization. So total cost may actually decrease with more chips.
How many chips do we need at minimum? This is a memory question, not a compute question. During training, HBM holds three things:
| Component | Formula | Size (LLaMA 70B) |
|---|---|---|
| Parameters (bf16) | 2 × P | 140 GB |
| Optimizer state (Adam, fp32) | 8 × P (fp32 copy + momentum + variance) | 560 GB |
| Gradient checkpoints | 2 × D × B × nckpt × L | ~20.9 TB (4 checkpoints/layer, 4M token batch) |
| Total | ~21.6 TB |
Let us derive the gradient checkpoint size. Each checkpoint saves the activation tensor at that point: shape (B, D) in bf16. With 4 checkpoints per layer:
With 96 GB HBM per TPU v5p chip, minimum chips = 21.6e12 / 96e9 = 225 chips. That is tiny compared to 8960! We are using those extra chips not because we need the memory but because we need the FLOPs to finish training in reasonable time.
On 8960 chips, memory per chip = 21.6 TB / 8960 ≈ 2.4 GB per chip. We are using only 2.5% of HBM. Even with 12 checkpoints per layer, we would still only be at ~8 GB per chip.
How many chips do we need at minimum for each LLaMA 3 model?
| Model | Param memory (bf16) | Optimizer (Adam fp32) | Total (no checkpoints) | Min TPU v5p chips |
|---|---|---|---|---|
| 8B | 16 GB | 96 GB | 112 GB | 2 |
| 70B | 140 GB | 840 GB | 980 GB | 11 |
| 405B | 810 GB | 4860 GB | 5670 GB | 60 |
Even the 405B model only needs 60 chips for the model state alone. The gradient checkpoints (which depend on batch size) can add significantly more, but the point remains: memory is not the binding constraint at production scales.
Why do we use bf16 for parameters and fp32 for optimizer? The parameters and gradients use bf16 (2 bytes each) because the forward and backward passes can tolerate reduced precision. But the Adam optimizer maintains a running average of gradients (momentum m) and squared gradients (variance v), plus a master copy of the weights. These accumulators require fp32 (4 bytes each) to avoid numerical instability — small gradient updates can underflow in bf16. Total per parameter: 4 (master weight) + 4 (m) + 4 (v) = 12 bytes.
What are gradient checkpoints? During the backward pass, we need the activations from the forward pass to compute gradients. Normally we store all activations, but for large models this is prohibitive. Gradient checkpointing (also called "activation recomputation") stores only a few checkpoint activations and recomputes the rest during the backward pass. The trade-off: extra compute (~33% more FLOPs) for much less memory.
With 4 checkpoints per layer, we store 4 activation tensors of shape (Bmicro, D) per layer. The "B" in the formula above is the full batch in tokens, because with FSDP each chip processes B/N tokens but we need checkpoints for the full batch across all microbatches.
What if we used only 1 checkpoint per layer?
Even with aggressive checkpointing and microbatching, memory per chip stays well under the 96 GB limit on v5p. This confirms that at these scales, we have enormous memory headroom.
Let us work through the parallelism config for LLaMA 3-70B on a full TPU v5p pod (8960 chips) with a 4M token batch (1024 sequences of length 4096).
Attempt 1: Pure FSDP. Can we shard everything with FSDP alone?
Attempt 2: FSDP + TP. Use the optimal FSDP formula:
Round to 2048-way FSDP, giving TP = 8960 / 2048 ≈ 4-way TP.
Let us verify both dimensions are compute-bound:
In practice, the split is: 1024-way DP (one sequence per DP rank), 2-way sequence/context parallelism (splitting 4096-length sequences in half), and 4-way TP (splitting weight matrices across 4 chips within each node).
Why not more TP? We could do 8-way TP with 1120-way FSDP. But the TP limit is Y < F × MY / 2200. With MY = 1 axis: Y < 28672/2200 = 13. So 8-way TP is fine. But with more TP, the per-FSDP-group batch would be 468 × 8 = 3744, and FSDP degree would be 1120. The FSDP threshold becomes: 3744 > 2550/1 = 2550? Yes! So 8-way TP + 1120 FSDP also works. But 4-way TP is simpler and has less TP communication, so it is preferred.
Can we train on fewer chips? Say we only had 2240 chips (a quarter-pod):
On a quarter-pod, we do not even need TP. Pure FSDP suffices because the per-chip batch is large enough. Training time = 44 × 4 = 176 days.
What about 225 chips (the minimum for memory)?
Technically possible. Practically absurd. This illustrates why large clusters exist: not for memory, but for speed.
The "Goldilocks zone" for chip count:
| Chips | Per-chip batch | Parallelism needed | Training time | Status |
|---|---|---|---|---|
| 225 | 17,778 | FSDP only | 4.8 years | Too slow |
| 2240 | 1,786 | FSDP only | 176 days | Slow but feasible |
| 8960 | 447 | FSDP + 4-way TP | 44 days | Production sweet spot |
| 35840 | 112 | PP + FSDP + TP | 11 days | Fast but complex |
Let us make the communication cost concrete. For a single FFN layer of LLaMA 70B (D=8192, F=28672):
Pure FSDP (8960-way):
FSDP (2048-way) + TP (4-way):
Let us also verify the memory picture. With 2048-way FSDP and 4-way TP:
Why so much wasted memory? This is a consequence of using 8960 chips for a problem that only needs 225 chips of memory. The remaining 93 GB per chip (97% of HBM!) sits unused. In theory, we could use this spare memory for:
In practice, frameworks like Megatron-LM and MaxText do use some of this spare memory for communication buffers and prefetching, which helps overlap communication with compute.
Practical training frameworks for LLaMA-scale models:
| Framework | Hardware | Parallelism support | Used by |
|---|---|---|---|
| Megatron-LM | NVIDIA GPU | TP, PP, DP, CP, EP | NVIDIA, Meta (LLaMA) |
| MaxText | TPU | FSDP, TP, PP (via JAX) | Google DeepMind |
| DeepSpeed | GPU | ZeRO-1/2/3, PP, TP | Microsoft |
| FSDP (PyTorch) | GPU | ZeRO-3 style | Meta, community |
| Nanotron | GPU | TP, PP, DP | Hugging Face |
All of these implement the same fundamental parallelism strategies we derived in this chapter. The differences are in engineering: how well they overlap communication with compute, how they handle fault tolerance, and how easy they are to configure. For a new project, the choice is usually determined by your hardware (TPU ⇒ MaxText, NVIDIA GPU ⇒ Megatron-LM or DeepSpeed).
Putting it all together — a complete training recipe for LLaMA 3-70B:
Every number in this recipe was derived from first principles in this chapter. You can repeat this analysis for any model by plugging in the architecture hyperparameters and hardware specs.
What about scaling to 4 pods (35,840 chips)? We would add 4-way pipeline parallelism across pods:
Use this interactive calculator to estimate training time and cost for different model sizes and hardware configurations.
Adjust parameters to estimate training time and cost.
Let us collect the key results for training LLaMA 3-70B:
| Quantity | Value | How derived |
|---|---|---|
| Parameter count | 70.4B | Sum of FFN (56.3B) + Attention (12.0B) + Embeddings (2.1B) |
| FLOPs/token/step | 4.2e11 | 6 × param count |
| Total FLOPs (15T tokens) | 6.3e24 | FLOPs/token × token count |
| Training time (8960 chips, 40% MFU) | 44 days | Total FLOPs / (N × C × MFU) |
| Minimum chips (memory) | 225 | 21.6 TB / 96 GB per chip |
| Parallelism config | 2048 FSDP × 4 TP | Xopt = √(2BN/F) |
| Estimated cost | ~$40M | N × $/hr × hours |
Extension: LLaMA 3-405B. Using the 405B config (D=16384, F=53248, L=126, V=128256):
This would use 8-way PP across pods, ~4096-way FSDP within each pod, and 4-way TP within each node. The PP bubble with 128 microbatches: (8-1)/(128+7) = 5.2%. Total efficiency: 0.35 × (1-0.052) ≈ 33% MFU after bubble.
Extension: LLaMA 3-8B. Much simpler (D=4096, F=14336, L=32):
Scaling laws and the cost of training: The Chinchilla scaling law suggests training on ~20 tokens per parameter for compute-optimal training. For 70B: 20 × 70B = 1.4T tokens. But LLaMA 3 was trained on 15T tokens — roughly 10x the Chinchilla-optimal amount. Why?
Because the Chinchilla law optimizes for training compute, not inference cost. A model trained on more data is smaller but equally capable, which means cheaper inference. Since inference cost often exceeds training cost over a model's lifetime, over-training a smaller model is economically rational. This is sometimes called the "inference-aware" or "overtrained" scaling regime.
Comparing training costs across model sizes:
| Model | Params | FLOPs (15T tokens) | v5p chips | MFU | Time | Cost |
|---|---|---|---|---|---|---|
| LLaMA 3-8B | 8B | 7.2e23 | 2,240 | 45% | ~16 days | ~$4.3M |
| LLaMA 3-70B | 70B | 6.3e24 | 8,960 | 40% | ~44 days | ~$40M |
| LLaMA 3-405B | 405B | 3.6e25 | 71,680 | 35% | ~36 days | ~$260M |
The 405B model costs roughly 6.5x more than the 70B despite having only 5.8x more parameters, because MFU tends to decrease at larger scales (more communication overhead, pipeline bubbles, etc.).
What limits MFU at scale?
| Overhead source | Typical cost | How to minimize |
|---|---|---|
| FSDP AllGather/ReduceScatter | 5-15% | Overlap with compute, use TP to reduce volume |
| TP ReduceScatter | 3-8% | Keep TP degree low, use fast intra-node links |
| PP bubble | 3-10% | More microbatches, interleaved schedules |
| Activation recomputation | ~33% | Cannot avoid — fundamental trade-off |
| Data loading / preprocessing | 1-5% | Prefetch data in background |
| Checkpointing overhead | 1-3% | Async checkpointing |
| Non-matmul ops (LayerNorm, etc.) | 2-5% | Fused kernels |
Activation recomputation is the single largest overhead. Since we recompute all activations during the backward pass, we effectively run the forward pass 1.33x times (once forward, then re-forward during backward). This alone limits theoretical max MFU to ~75%. Add communication and other overheads, and 40-50% is genuinely good.
What about training on H100 GPUs instead? Meta's actual LLaMA 3 training used 16,384 H100 GPUs. Let us estimate:
Comparable to our TPU estimate. The training time is shorter (11 vs 44 days) because we have nearly twice the total FLOPs, but the dollar cost is similar because H100s cost more per chip.
Efficiency comparison:
| Metric | 8960 TPU v5p | 16384 H100 |
|---|---|---|
| Total peak FLOPs/s | 4.11e18 | 1.62e19 |
| Training time (40% MFU) | 44 days | 11 days |
| Total chip-hours | 9.44M | 4.44M |
| Estimated cost | ~$40M | ~$48M |