Open-weight foundation models at 8B, 70B, and 405B parameters. 15 trillion training tokens, grouped query attention, SwiGLU, RoPE — and a post-training recipe that turns base models into capable assistants.
In 2024, the most capable language models — GPT-4, Claude 3, Gemini Ultra — are proprietary. You send your data to an API, pay per token, and hope the provider doesn't change the model, raise prices, or discontinue it. You can't inspect the weights, can't fine-tune on your private data, can't run it on your own hardware, and can't verify what it's doing with your inputs.
Meta's Llama 3 challenges this paradigm. By releasing open weights at 8B, 70B, and 405B parameters, Meta made a bet: the best way to advance AI is to let everyone experiment, fine-tune, and build on top of the same foundation. The 405B model is particularly significant — it's the first open model that approaches the performance of GPT-4 and Claude 3.5 on major benchmarks.
| Model | Params | Context | Open? | MMLU (5-shot) |
|---|---|---|---|---|
| Llama 3 8B | 8B | 128K | Yes | 68.4 |
| Llama 3 70B | 70B | 128K | Yes | 82.0 |
| Llama 3 405B | 405B | 128K | Yes | 87.3 |
| GPT-4 (0125) | ~1.8T (est.) | 128K | No | 86.4 |
| Claude 3.5 Sonnet | Unknown | 200K | No | 88.7 |
The Llama 3 paper is unusual: it's not just a model release — it's a 92-page technical report that documents the entire pipeline from data curation to post-training to safety evaluation. It's essentially a textbook on how to train a modern large language model at scale. We'll work through the key components.
Compare Llama 3 against other models. The x-axis shows parameter count (log scale), the y-axis shows MMLU performance. Open models are shown in teal; closed models in orange. Notice how Llama 3 405B closes the gap with proprietary models.
| Version | Date | Sizes | Training tokens | Key change |
|---|---|---|---|---|
| Llama 1 | Feb 2023 | 7B-65B | 1.4T | First open competitive model |
| Llama 2 | Jul 2023 | 7B-70B | 2T | RLHF, chat-tuned, 4K context |
| Llama 3 | Jul 2024 | 8B-405B | 15T | 128K context, 405B scale, multimodal |
The jump from Llama 2 to Llama 3 is dramatic: 7.5x more training tokens (2T → 15T), new architectural decisions (GQA, SwiGLU), massive scale-up to 405B parameters, and a comprehensive post-training pipeline. Let's understand each piece.
Llama 3 uses a standard decoder-only Transformer — the same fundamental architecture as GPT. But every component has been carefully optimized. Meta's philosophy: "Our design philosophy is to use a relatively standard Transformer architecture with minor modifications, rather than developing a bespoke architecture."
Here are the specific architecture choices, each with the engineering rationale:
Standard multi-head attention (MHA) gives each head its own query, key, and value projections. With 128 heads in the 405B model, that's 128 separate key-value pairs per layer — expensive during inference because all KV pairs must be cached.
Grouped Query Attention (Ainslie et al., 2023) shares key-value heads across groups of query heads. Instead of 128 KV heads, Llama 3 uses only 8 KV groups — each group of 16 query heads shares the same keys and values.
python # Grouped Query Attention — 8 KV heads shared across 128 query heads import torch import torch.nn as nn class GQA(nn.Module): def __init__(self, d_model=16384, n_heads=128, n_kv_heads=8): super().__init__() self.head_dim = d_model // n_heads # 128 self.n_heads = n_heads # 128 query heads self.n_kv_heads = n_kv_heads # 8 KV groups self.n_rep = n_heads // n_kv_heads # 16 queries per KV group self.wq = nn.Linear(d_model, n_heads * self.head_dim, bias=False) self.wk = nn.Linear(d_model, n_kv_heads * self.head_dim, bias=False) self.wv = nn.Linear(d_model, n_kv_heads * self.head_dim, bias=False) self.wo = nn.Linear(n_heads * self.head_dim, d_model, bias=False) def forward(self, x): B, T, _ = x.shape q = self.wq(x).view(B, T, self.n_heads, self.head_dim) k = self.wk(x).view(B, T, self.n_kv_heads, self.head_dim) v = self.wv(x).view(B, T, self.n_kv_heads, self.head_dim) # Repeat KV heads to match query heads k = k.repeat_interleave(self.n_rep, dim=2) # [B,T,128,128] v = v.repeat_interleave(self.n_rep, dim=2) # [B,T,128,128] # Standard attention from here # ...
The original Transformer uses a two-layer FFN with ReLU: FFN(x) = ReLU(xW1)W2. Llama 3 replaces this with SwiGLU (Shazeer 2020), which uses a gated linear unit with the SiLU (Swish) activation:
Where ⊙ is element-wise multiplication and SiLU(x) = x · σ(x) is the Sigmoid Linear Unit. This introduces a third weight matrix (Wgate) but consistently improves quality. The gating mechanism lets the network learn which dimensions to activate, which is more expressive than a simple ReLU.
Instead of learned absolute position embeddings (like BERT) or sinusoidal embeddings (like the original Transformer), Llama 3 uses RoPE (Su et al., 2021). RoPE encodes position by rotating the query and key vectors in 2D subspaces:
Where θ is a frequency vector and rotate() shifts dimensions by one position. The key property: the attention score between positions i and j depends only on their relative distance (i - j), not their absolute positions. This makes the model naturally length-generalizable — it can handle sequences longer than those seen during training.
| Component | Choice | Rationale |
|---|---|---|
| Normalization | RMSNorm (pre-norm) | Faster than LayerNorm (no mean computation), pre-norm is more stable for deep networks |
| Vocabulary | 128K tokens (BPE via tiktoken) | Larger vocab = fewer tokens per text = more efficient processing |
| Context length | 128K tokens | Extended from 8K via RoPE frequency scaling in later training stages |
| Attention bias | None (no bias in QKV projections) | Reduces parameters, no quality loss |
| Embedding tying | No (separate input/output embeddings) | Separate embeddings improve quality at 405B scale |
Explore the Llama 3 Transformer block. Click on each component to see its details: GQA attention, SwiGLU FFN, RMSNorm, and RoPE. The 405B model stacks 126 of these blocks.
| Parameter | 8B | 70B | 405B |
|---|---|---|---|
| Layers | 32 | 80 | 126 |
| Hidden dim | 4096 | 8192 | 16384 |
| Query heads | 32 | 64 | 128 |
| KV heads | 8 | 8 | 8 |
| FFN dim | 14336 | 28672 | 53248 |
| Head dim | 128 | 128 | 128 |
| Vocab | 128K | 128K | 128K |
Llama 3 was trained on approximately 15 trillion tokens — roughly 10x more than Llama 2 and 5x more than GPT-3. At the 405B scale, this means the model saw each token roughly 1.5 times on average (15T tokens / ~10T unique tokens). The data pipeline that produces these tokens is arguably as important as the model architecture itself.
| Source | Share | Tokens | Description |
|---|---|---|---|
| Web crawl | ~85% | ~12.7T | Filtered CommonCrawl — the backbone. Multiple rounds of dedup and quality filtering. |
| Code | ~8% | ~1.2T | GitHub, StackOverflow, code-heavy web pages. 10+ programming languages. |
| Books | ~3% | ~0.45T | Curated book collections, Project Gutenberg, academic texts. |
| Academic | ~2.5% | ~0.38T | arXiv, PubMed, academic web pages, scientific papers. |
| Wikipedia + encyclopedias | ~1.5% | ~0.23T | Wikipedia in 30+ languages, encyclopedic sources. |
Raw CommonCrawl is noisy — full of spam, duplicated content, toxic text, and low-quality pages. Llama 3's data pipeline applies multiple cleaning stages:
Watch how raw web crawl data is progressively filtered into high-quality training tokens. Each stage removes a portion of the data. The funnel shows approximate data volumes at each stage.
Not all data is created equal. Meta uses data annealing — changing the data mixture during training:
| Training Phase | Data Mix | Rationale |
|---|---|---|
| Main training (first ~14T) | 85% web, 8% code, 3% books, ... | Broad knowledge acquisition |
| Annealing (final ~1T) | Upweight high-quality sources | Polish quality on the best data |
| Long-context extension | Extend to 128K with long documents | Teach long-range dependencies |
The annealing phase is particularly clever: in the final stages of training, the learning rate is reduced to near-zero and the data mixture is shifted to emphasize the highest-quality documents. This is analogous to "polishing" a lens — the final refinement that sharpens quality without the risk of large updates destabilizing the model.
Llama 3 quadrupled the vocabulary from 32K (Llama 2) to 128K tokens. This means each piece of text is encoded in fewer tokens, which:
| Benefit | Mechanism | Impact |
|---|---|---|
| Faster inference | Fewer tokens to generate | ~15% fewer tokens for English text |
| More information per step | Each token carries more meaning | Better for code, math, multilingual |
| Better multilingual | Non-English scripts get more tokens | Significantly fewer tokens for CJK, Arabic, etc. |
Training a 405B parameter model on 15T tokens is one of the largest compute operations ever undertaken. Let's understand the engineering required.
Llama 3 405B was trained on 16,384 NVIDIA H100 GPUs, organized into clusters connected by high-bandwidth networking:
| Component | Specification |
|---|---|
| GPUs | 16,384 × NVIDIA H100 80GB |
| GPU memory | ~1.3 PB total (16K × 80 GB) |
| Interconnect (intra-node) | NVLink 4.0, 900 GB/s per GPU |
| Interconnect (inter-node) | 400 Gbps InfiniBand per GPU |
| Total compute | ~3.8 × 1025 FLOPs |
| Training time | ~54 days for 405B |
A 405B model with fp16 weights requires ~810 GB just for the parameters. A single H100 has 80 GB of memory. You can't fit even the weights on one GPU, let alone the optimizer states and activations. The solution: split the model across thousands of GPUs using multiple parallelism strategies simultaneously.
See how 16,384 GPUs are organized. Tensor parallelism splits within a node (8 GPUs). Pipeline parallelism chains nodes into stages. Data parallelism replicates the full pipeline. Click each strategy to highlight it.
At this scale, hardware failures are the norm, not the exception. With 16K GPUs running for 54 days:
| Challenge | Frequency | Mitigation |
|---|---|---|
| GPU failure | ~Daily | Automated detection + restart from checkpoint |
| Network issue | Several per day | Redundant paths, graceful reconnection |
| Loss spike | ~3 during 405B training | Roll back to earlier checkpoint, skip problematic data |
| NaN in gradients | Rare but catastrophic | Gradient clipping, loss scaling, checkpoint recovery |
Meta reports that during the 405B training run, they encountered 466 job interruptions, of which 78% were due to unexpected hardware issues. Their automated recovery system was able to restart from the latest checkpoint within minutes, achieving an effective training uptime of ~90%.
How do you decide whether to train an 8B, 70B, or 405B model? How many tokens should each see? How do you set the learning rate and batch size? Meta used scaling laws — empirical relationships between model size, data size, and compute — to answer these questions before committing to the full training runs.
Meta trained hundreds of small models (40M to 16B parameters) on varying amounts of data and measured their validation loss. They then fit power-law curves to predict the performance of larger models:
Where N is the number of parameters, D is the number of training tokens, A, B, α, β are fitted constants, and L∞ is the irreducible loss (the best possible loss with infinite data and model size). This equation has three terms:
| Term | Meaning |
|---|---|
| A / Nα | Loss reduction from model size — bigger models memorize more patterns |
| B / Dβ | Loss reduction from data — more data prevents overfitting |
| L∞ | Irreducible loss — the entropy of natural language itself |
See how validation loss decreases with model size and training tokens. The curves show power-law scaling: each doubling of model size or data gives diminishing but consistent returns. Drag the slider to adjust the compute budget and see the optimal model size vs data allocation.
| Phase | Learning Rate | Schedule |
|---|---|---|
| Warmup | 0 → 8×10-5 | Linear, 8000 steps |
| Main | 8×10-5 → 8×10-7 | Cosine decay over ~15T tokens |
| Annealing | 8×10-7 → 0 | Linear decay over final ~1T tokens |
Pre-training produces a powerful but "raw" model — it can complete text fluently but doesn't know how to follow instructions, refuse harmful requests, or engage in helpful dialogue. Post-training transforms this base model into an aligned assistant through multiple stages of fine-tuning.
SFT uses carefully curated prompt-response pairs. Meta emphasizes quality over quantity: they found that 27,540 high-quality examples outperformed 10x more lower-quality examples. The SFT data covers:
| Category | Examples | Purpose |
|---|---|---|
| General helpfulness | ~10K | Answer questions, follow instructions, explain concepts |
| Safety | ~5K | Refuse harmful requests, provide safety context |
| Code | ~5K | Write, debug, explain code |
| Math/reasoning | ~4K | Step-by-step problem solving |
| Multilingual | ~3K | Follow instructions in non-English languages |
The reward model is initialized from the pre-trained base model and trained to predict which of two responses a human annotator would prefer. Given a prompt x and two responses yw (preferred) and yl (rejected):
Where r(x, y) is the scalar reward score and σ is the sigmoid function. This is the Bradley-Terry model of pairwise preferences — the same used in chess ELO ratings.
Instead of training a separate reward model and using PPO (as in RLHF), Llama 3 primarily uses DPO (Rafailov et al., 2023), which directly optimizes the policy using preference data:
DPO avoids the instability of PPO training and is simpler to implement. The key idea: instead of fitting a reward model and then optimizing against it, DPO uses the language model itself as an implicit reward model.
Watch the iterative post-training process. Each round consists of SFT, reward model training, and DPO. The model quality improves with each round as better models generate better training data. Click "Next Round" to advance.
In addition to DPO, Meta uses rejection sampling: generate K responses from the current model, score each with the reward model, and keep only the best one for SFT. This is simpler than DPO and provides high-quality training examples for the next round of SFT.
python # Rejection sampling: generate K responses, keep the best def rejection_sample(model, reward_model, prompt, K=16): responses = [model.generate(prompt, temperature=0.7) for _ in range(K)] scores = [reward_model.score(prompt, r) for r in responses] best_idx = max(range(K), key=lambda i: scores[i]) return responses[best_idx] # highest-scoring response # This is compute-intensive but produces very high quality # training examples for the next SFT round
The Llama 3 paper describes experiments extending the language model to handle images, video, and speech — turning it into a multimodal model. These extensions follow a common pattern: keep the pre-trained language model mostly frozen and train adapter modules that project other modalities into the language model's embedding space.
The vision extension uses a pre-trained image encoder (ViT-H) connected to the language model via a learned projection layer:
The cross-attention adapter is key: instead of simply projecting image features through an MLP (as in LLaVA), Llama 3 uses cross-attention layers inserted between Transformer blocks. The image tokens serve as keys and values, while the text tokens serve as queries. This gives the language model fine-grained access to spatial information in the image.
Video is handled by sampling frames uniformly, encoding each with the image encoder, and concatenating the resulting tokens. For a 30-second video at 1 fps, this produces 30 × N_patches tokens — which quickly consumes the context window. Meta addresses this with temporal aggregation: adjacent frames' features are pooled to reduce the token count.
Speech is encoded using a separate audio encoder (trained from scratch) that converts waveforms into discrete token sequences. These are interleaved with text tokens, allowing the model to understand and respond to spoken input.
See how different modalities connect to the Llama 3 language model through adapters. Click each modality to see its specific pathway. The language model backbone remains shared across all modalities.
| Benchmark | Llama 3 405B | GPT-4V | Gemini 1.5 Pro |
|---|---|---|---|
| MMMU (val) | 64.5 | 63.1 | 62.2 |
| VQAv2 | 79.0 | 77.2 | — |
| ChartQA | 83.2 | 78.5 | 81.3 |
A 405B open-weight model can be used for anything — including harmful purposes. Meta addresses this through multiple layers of safety measures, while acknowledging that open models present fundamentally different safety challenges than API-only models.
Llama Guard is a separate safety classifier (based on Llama 3 8B) that classifies inputs and outputs into safe/unsafe categories. It acts as a guardrail that can be deployed alongside the main model:
Llama Guard classifies across these safety categories:
| Category | Examples |
|---|---|
| S1: Violent crimes | Instructions for weapons, terrorism, assault |
| S2: Non-violent crimes | Fraud, hacking, drug trafficking |
| S3: Sex-related crimes | CSAM, trafficking, non-consensual content |
| S4: Child safety | Content exploiting or endangering minors |
| S5: Defamation | False statements damaging reputation |
| S6: Regulated advice | Unauthorized legal, medical, financial advice |
| S7: Privacy | PII exposure, stalking, doxxing |
| S8: IP violations | Copyright infringement, trademark abuse |
Beyond Llama Guard (which is an external classifier), safety is also baked into the model itself during post-training:
| Method | Mechanism |
|---|---|
| Safety SFT data | ~5K examples of appropriate refusals and safety-aware responses |
| Red teaming | Adversarial attacks by human experts to find failure modes before release |
| Safety DPO | Preference data specifically targeting safety: harmful responses are labeled as "rejected" |
| System prompt | Default instructions that guide the model toward safe behavior |
See how a user query flows through the safety system. Both input and output are checked by Llama Guard. Click "Test Safe" or "Test Unsafe" to see how the system handles different types of queries.
Let's put it all together. This interactive explorer lets you compare all three Llama 3 model sizes across benchmarks, explore the architecture at different scales, and see how the training pipeline transforms a base model into an aligned assistant.
Compare Llama 3 models (8B, 70B, 405B) against each other and against GPT-4 across major benchmarks. Click model names to toggle their visibility. Hover over bars to see exact scores.
| Benchmark | 8B | 70B | 405B | GPT-4 |
|---|---|---|---|---|
| MMLU (5-shot) | 68.4 | 82.0 | 87.3 | 86.4 |
| GSM8K (8-shot CoT) | 79.6 | 95.1 | 96.8 | 92.0 |
| MATH (4-shot) | 30.0 | 50.4 | 73.8 | 52.9 |
| HumanEval | 72.6 | 80.5 | 89.0 | 86.6 |
| ARC-Challenge | 83.4 | 93.0 | 96.9 | 96.4 |
| GPQA | 32.8 | 46.7 | 51.1 | 41.4 |
How efficiently does each model use its parameters? We can measure this as benchmark score per billion parameters:
| Model | MMLU | MMLU/Billion Params | Tokens Seen | FLOPs |
|---|---|---|---|---|
| 8B | 68.4 | 8.55 | 15T | ~1.8×1024 |
| 70B | 82.0 | 1.17 | 15T | ~1.6×1025 |
| 405B | 87.3 | 0.22 | 15T | ~3.8×1025 |
The 8B model is by far the most parameter-efficient — each additional billion parameters yields diminishing returns. This is why the 8B model is the most popular for deployment: it offers 80% of the 405B's quality at 2% of the inference cost.
Llama 3 represents the culmination of years of research in scaling, training recipes, and alignment. Let's place it in the broader context.
| Foundation | Contribution |
|---|---|
| Transformer (2017) | The decoder-only architecture Llama 3 uses |
| Chinchilla (2022) | Scaling laws that guided compute allocation decisions |
| GQA (Ainslie 2023) | Grouped query attention for efficient KV caching |
| SwiGLU (Shazeer 2020) | Gated FFN activation that consistently improves quality |
| RoPE (Su 2021) | Rotary position embeddings for length generalization |
| DPO (Rafailov 2023) | Direct preference optimization used in post-training |
| RLHF (Ouyang 2022) | The reward model + optimization paradigm for alignment |
| Model Family | Key Feature | Relation to Llama 3 |
|---|---|---|
| Mistral / Mixtral | Mixture of experts (MoE) | Alternative scaling strategy — more params, less FLOPs per token |
| Qwen 2 | Strong multilingual | Competitive open model from Alibaba, similar architecture |
| Gemma 2 | Google open weights | Smaller models (2B-27B), similar philosophy |
| Phi-3 | Small model excellence | Microsoft's bet on data quality over model size |
| DeepSeek-V2 | MoE + MLA attention | Innovative attention mechanism alternative to GQA |
The Llama series shows a clear trajectory: each generation gets more data (1.4T → 2T → 15T), larger models (65B → 70B → 405B), and more sophisticated post-training (no RLHF → RLHF → iterative DPO). If this trend continues, Llama 4 will likely feature even more training data (50T+?), longer context (1M+?), native multimodality, and potentially mixture-of-experts architectures for efficiency.
"Our experience suggests that it is possible to train models at the frontier of AI capabilities using a standard, dense Transformer architecture. Neither mixture-of-experts models, nor a novel architecture, are required."
— Llama 3 Technical Report, Meta AI