Meta AI — July 2024

The Llama 3 Herd of Models

Open-weight foundation models at 8B, 70B, and 405B parameters. 15 trillion training tokens, grouped query attention, SwiGLU, RoPE — and a post-training recipe that turns base models into capable assistants.

Prerequisites: Transformer architecture + Self-attention + Training basics. That's it.
10
Chapters
10+
Simulations
0
Assumed Knowledge

Chapter 0: Why Open Models Matter

In 2024, the most capable language models — GPT-4, Claude 3, Gemini Ultra — are proprietary. You send your data to an API, pay per token, and hope the provider doesn't change the model, raise prices, or discontinue it. You can't inspect the weights, can't fine-tune on your private data, can't run it on your own hardware, and can't verify what it's doing with your inputs.

Meta's Llama 3 challenges this paradigm. By releasing open weights at 8B, 70B, and 405B parameters, Meta made a bet: the best way to advance AI is to let everyone experiment, fine-tune, and build on top of the same foundation. The 405B model is particularly significant — it's the first open model that approaches the performance of GPT-4 and Claude 3.5 on major benchmarks.

Why release a 405B model openly? Meta argues that open models accelerate the entire field: researchers can study scaling behavior, practitioners can fine-tune for specialized domains, and safety researchers can probe for risks. Closed models advance one company; open models advance the ecosystem. Whether this is genuine altruism, competitive strategy, or both — the practical impact is that anyone with sufficient hardware can run a GPT-4-class model locally.
ModelParamsContextOpen?MMLU (5-shot)
Llama 3 8B8B128KYes68.4
Llama 3 70B70B128KYes82.0
Llama 3 405B405B128KYes87.3
GPT-4 (0125)~1.8T (est.)128KNo86.4
Claude 3.5 SonnetUnknown200KNo88.7

The Llama 3 paper is unusual: it's not just a model release — it's a 92-page technical report that documents the entire pipeline from data curation to post-training to safety evaluation. It's essentially a textbook on how to train a modern large language model at scale. We'll work through the key components.

The Open Model Landscape

Compare Llama 3 against other models. The x-axis shows parameter count (log scale), the y-axis shows MMLU performance. Open models are shown in teal; closed models in orange. Notice how Llama 3 405B closes the gap with proprietary models.

The Llama lineage

VersionDateSizesTraining tokensKey change
Llama 1Feb 20237B-65B1.4TFirst open competitive model
Llama 2Jul 20237B-70B2TRLHF, chat-tuned, 4K context
Llama 3Jul 20248B-405B15T128K context, 405B scale, multimodal

The jump from Llama 2 to Llama 3 is dramatic: 7.5x more training tokens (2T → 15T), new architectural decisions (GQA, SwiGLU), massive scale-up to 405B parameters, and a comprehensive post-training pipeline. Let's understand each piece.

What makes the Llama 3 405B model historically significant?

Chapter 1: Architecture Decisions

Llama 3 uses a standard decoder-only Transformer — the same fundamental architecture as GPT. But every component has been carefully optimized. Meta's philosophy: "Our design philosophy is to use a relatively standard Transformer architecture with minor modifications, rather than developing a bespoke architecture."

Here are the specific architecture choices, each with the engineering rationale:

Grouped Query Attention (GQA)

Standard multi-head attention (MHA) gives each head its own query, key, and value projections. With 128 heads in the 405B model, that's 128 separate key-value pairs per layer — expensive during inference because all KV pairs must be cached.

Grouped Query Attention (Ainslie et al., 2023) shares key-value heads across groups of query heads. Instead of 128 KV heads, Llama 3 uses only 8 KV groups — each group of 16 query heads shares the same keys and values.

MHA: Qi, Ki, Vi for each head i   (128 KV heads)
GQA: Qi unique, Kg(i), Vg(i) shared within group   (8 KV heads)
Why GQA? During autoregressive inference, the model must cache all previous key-value pairs (the "KV cache"). With 128 heads at 128-dim each, the cache for one layer is 128 × 128 × 2 (K+V) × sequence_length × bytes_per_element. GQA reduces this by 16x (128/8), making long-context inference feasible. The quality loss vs full MHA is minimal (< 0.5% on benchmarks) because the queries still have full expressiveness — only the keys and values are shared.
python
# Grouped Query Attention — 8 KV heads shared across 128 query heads
import torch
import torch.nn as nn

class GQA(nn.Module):
    def __init__(self, d_model=16384, n_heads=128, n_kv_heads=8):
        super().__init__()
        self.head_dim = d_model // n_heads  # 128
        self.n_heads = n_heads              # 128 query heads
        self.n_kv_heads = n_kv_heads        # 8 KV groups
        self.n_rep = n_heads // n_kv_heads  # 16 queries per KV group

        self.wq = nn.Linear(d_model, n_heads * self.head_dim, bias=False)
        self.wk = nn.Linear(d_model, n_kv_heads * self.head_dim, bias=False)
        self.wv = nn.Linear(d_model, n_kv_heads * self.head_dim, bias=False)
        self.wo = nn.Linear(n_heads * self.head_dim, d_model, bias=False)

    def forward(self, x):
        B, T, _ = x.shape
        q = self.wq(x).view(B, T, self.n_heads, self.head_dim)
        k = self.wk(x).view(B, T, self.n_kv_heads, self.head_dim)
        v = self.wv(x).view(B, T, self.n_kv_heads, self.head_dim)
        # Repeat KV heads to match query heads
        k = k.repeat_interleave(self.n_rep, dim=2)  # [B,T,128,128]
        v = v.repeat_interleave(self.n_rep, dim=2)  # [B,T,128,128]
        # Standard attention from here
        # ...

SwiGLU Feed-Forward Network

The original Transformer uses a two-layer FFN with ReLU: FFN(x) = ReLU(xW1)W2. Llama 3 replaces this with SwiGLU (Shazeer 2020), which uses a gated linear unit with the SiLU (Swish) activation:

FFNSwiGLU(x) = (SiLU(x Wgate) ⊙ x Wup) Wdown

Where ⊙ is element-wise multiplication and SiLU(x) = x · σ(x) is the Sigmoid Linear Unit. This introduces a third weight matrix (Wgate) but consistently improves quality. The gating mechanism lets the network learn which dimensions to activate, which is more expressive than a simple ReLU.

RoPE (Rotary Position Embeddings)

Instead of learned absolute position embeddings (like BERT) or sinusoidal embeddings (like the original Transformer), Llama 3 uses RoPE (Su et al., 2021). RoPE encodes position by rotating the query and key vectors in 2D subspaces:

RoPE(x, pos) = x · cos(pos · θ) + rotate(x) · sin(pos · θ)

Where θ is a frequency vector and rotate() shifts dimensions by one position. The key property: the attention score between positions i and j depends only on their relative distance (i - j), not their absolute positions. This makes the model naturally length-generalizable — it can handle sequences longer than those seen during training.

RoPE's elegance: It achieves relative position encoding through rotation, not addition. Adding position information (like BERT) contaminates the content representation. Rotating it preserves the magnitude of the content vector while encoding position in the angle. And because rotation is linear, it plays nicely with the linear projections in attention.

Other architecture details

ComponentChoiceRationale
NormalizationRMSNorm (pre-norm)Faster than LayerNorm (no mean computation), pre-norm is more stable for deep networks
Vocabulary128K tokens (BPE via tiktoken)Larger vocab = fewer tokens per text = more efficient processing
Context length128K tokensExtended from 8K via RoPE frequency scaling in later training stages
Attention biasNone (no bias in QKV projections)Reduces parameters, no quality loss
Embedding tyingNo (separate input/output embeddings)Separate embeddings improve quality at 405B scale
Llama 3 Architecture Diagram

Explore the Llama 3 Transformer block. Click on each component to see its details: GQA attention, SwiGLU FFN, RMSNorm, and RoPE. The 405B model stacks 126 of these blocks.

Model sizes at a glance

Parameter8B70B405B
Layers3280126
Hidden dim4096819216384
Query heads3264128
KV heads888
FFN dim143362867253248
Head dim128128128
Vocab128K128K128K
Why does Llama 3 use Grouped Query Attention (GQA) with only 8 KV heads instead of full multi-head attention with 128 KV heads?

Chapter 2: Pre-training Data Pipeline

Llama 3 was trained on approximately 15 trillion tokens — roughly 10x more than Llama 2 and 5x more than GPT-3. At the 405B scale, this means the model saw each token roughly 1.5 times on average (15T tokens / ~10T unique tokens). The data pipeline that produces these tokens is arguably as important as the model architecture itself.

Data composition

SourceShareTokensDescription
Web crawl~85%~12.7TFiltered CommonCrawl — the backbone. Multiple rounds of dedup and quality filtering.
Code~8%~1.2TGitHub, StackOverflow, code-heavy web pages. 10+ programming languages.
Books~3%~0.45TCurated book collections, Project Gutenberg, academic texts.
Academic~2.5%~0.38TarXiv, PubMed, academic web pages, scientific papers.
Wikipedia + encyclopedias~1.5%~0.23TWikipedia in 30+ languages, encyclopedic sources.
Why 15 trillion tokens? Llama 3 follows the Chinchilla scaling laws (Hoffmann et al. 2022), which predict that optimal training uses ~20 tokens per parameter. For 405B parameters, that's 8.1T tokens at the Chinchilla optimal. Meta went 1.85x beyond this because they found that more tokens continue to improve quality, even past the theoretical optimum — especially for smaller models (8B and 70B) where additional tokens compensate for limited model capacity.

Data quality pipeline

Raw CommonCrawl is noisy — full of spam, duplicated content, toxic text, and low-quality pages. Llama 3's data pipeline applies multiple cleaning stages:

Stage 1: URL filtering
Remove known spam domains, adult content, phishing sites. Maintain domain allow/deny lists.
Stage 2: Text extraction
Parse HTML, remove boilerplate (nav bars, footers, ads). Keep only the main content.
Stage 3: Language ID
FastText classifier identifies language. Keep English and 30+ other languages at calibrated ratios.
Stage 4: Quality classifier
A BERT-based classifier scores each document for quality (trained on Wikipedia + reference documents vs random web). Low-quality documents are removed or downsampled.
Stage 5: Deduplication
MinHash-based near-duplicate removal at document level. Exact n-gram dedup at line level. Removes ~50% of remaining data.
Stage 6: Safety filtering
Remove PII (names, emails, phone numbers), toxic content, and CSAM. Multiple classifiers + heuristics.
Data Pipeline Funnel

Watch how raw web crawl data is progressively filtered into high-quality training tokens. Each stage removes a portion of the data. The funnel shows approximate data volumes at each stage.

Click to see filtering

Data mixing and annealing

Not all data is created equal. Meta uses data annealing — changing the data mixture during training:

Training PhaseData MixRationale
Main training (first ~14T)85% web, 8% code, 3% books, ...Broad knowledge acquisition
Annealing (final ~1T)Upweight high-quality sourcesPolish quality on the best data
Long-context extensionExtend to 128K with long documentsTeach long-range dependencies

The annealing phase is particularly clever: in the final stages of training, the learning rate is reduced to near-zero and the data mixture is shifted to emphasize the highest-quality documents. This is analogous to "polishing" a lens — the final refinement that sharpens quality without the risk of large updates destabilizing the model.

Tokenization: 128K BPE vocabulary

Llama 3 quadrupled the vocabulary from 32K (Llama 2) to 128K tokens. This means each piece of text is encoded in fewer tokens, which:

BenefitMechanismImpact
Faster inferenceFewer tokens to generate~15% fewer tokens for English text
More information per stepEach token carries more meaningBetter for code, math, multilingual
Better multilingualNon-English scripts get more tokensSignificantly fewer tokens for CJK, Arabic, etc.
Why did Meta train Llama 3 on 15T tokens — nearly 2x the Chinchilla-optimal amount for 405B parameters?

Chapter 3: Training Infrastructure

Training a 405B parameter model on 15T tokens is one of the largest compute operations ever undertaken. Let's understand the engineering required.

Hardware

Llama 3 405B was trained on 16,384 NVIDIA H100 GPUs, organized into clusters connected by high-bandwidth networking:

ComponentSpecification
GPUs16,384 × NVIDIA H100 80GB
GPU memory~1.3 PB total (16K × 80 GB)
Interconnect (intra-node)NVLink 4.0, 900 GB/s per GPU
Interconnect (inter-node)400 Gbps InfiniBand per GPU
Total compute~3.8 × 1025 FLOPs
Training time~54 days for 405B
The scale in perspective: 16,384 H100 GPUs running for 54 days consumes approximately 30 GWh of electricity — roughly the annual electricity consumption of 3,000 US homes. The hardware alone (not including networking, cooling, or storage) costs approximately $500M at list price. This is why open-weight releases matter: very few organizations can afford to train a model of this scale.

Parallelism strategy

A 405B model with fp16 weights requires ~810 GB just for the parameters. A single H100 has 80 GB of memory. You can't fit even the weights on one GPU, let alone the optimizer states and activations. The solution: split the model across thousands of GPUs using multiple parallelism strategies simultaneously.

Tensor Parallelism (TP=8)
Split each layer's weight matrices across 8 GPUs within a node. Each GPU computes a slice of the attention/FFN and they communicate via NVLink (fast, 900 GB/s).
+
Pipeline Parallelism (PP=16)
Split the 126 layers across 16 pipeline stages (~8 layers each). Micro-batch pipelining keeps all stages busy. Communicates between nodes via InfiniBand.
+
Data Parallelism (DP=128)
128 replicas of the full pipeline, each processing different data. Gradients are averaged across replicas via all-reduce.
Total GPUs = TP × PP × DP = 8 × 16 × 128 = 16,384
3D Parallelism Visualizer

See how 16,384 GPUs are organized. Tensor parallelism splits within a node (8 GPUs). Pipeline parallelism chains nodes into stages. Data parallelism replicates the full pipeline. Click each strategy to highlight it.

Training stability

At this scale, hardware failures are the norm, not the exception. With 16K GPUs running for 54 days:

ChallengeFrequencyMitigation
GPU failure~DailyAutomated detection + restart from checkpoint
Network issueSeveral per dayRedundant paths, graceful reconnection
Loss spike~3 during 405B trainingRoll back to earlier checkpoint, skip problematic data
NaN in gradientsRare but catastrophicGradient clipping, loss scaling, checkpoint recovery

Meta reports that during the 405B training run, they encountered 466 job interruptions, of which 78% were due to unexpected hardware issues. Their automated recovery system was able to restart from the latest checkpoint within minutes, achieving an effective training uptime of ~90%.

Why does training the 405B model require three different parallelism strategies (tensor, pipeline, and data parallelism) simultaneously?

Chapter 4: Scaling Laws & Training Recipe

How do you decide whether to train an 8B, 70B, or 405B model? How many tokens should each see? How do you set the learning rate and batch size? Meta used scaling laws — empirical relationships between model size, data size, and compute — to answer these questions before committing to the full training runs.

The scaling law methodology

Meta trained hundreds of small models (40M to 16B parameters) on varying amounts of data and measured their validation loss. They then fit power-law curves to predict the performance of larger models:

L(N, D) = A / Nα + B / Dβ + L

Where N is the number of parameters, D is the number of training tokens, A, B, α, β are fitted constants, and L is the irreducible loss (the best possible loss with infinite data and model size). This equation has three terms:

TermMeaning
A / NαLoss reduction from model size — bigger models memorize more patterns
B / DβLoss reduction from data — more data prevents overfitting
LIrreducible loss — the entropy of natural language itself
A critical finding: Meta's scaling experiments showed that the Chinchilla-optimal compute allocation (roughly equal investment in model size and data) is not optimal when inference cost matters. Since you only train once but serve the model millions of times, it's better to "overtrain" a smaller model on more data — getting a model that's cheaper to serve at similar quality. This is why the 8B model was trained on 15T tokens (1875 tokens per parameter) rather than the Chinchilla-optimal ~160B tokens (20 tokens per parameter).

Training recipe

Phase 1: Initial pre-training
Standard training on ~14T tokens with 8K context. Batch size ramps from small to 16M tokens. Learning rate warms up then follows cosine decay.
Phase 2: Long-context extension
Gradually extend context from 8K → 128K tokens over ~800B tokens. RoPE base frequency increased from 500K → 8M to support longer sequences.
Phase 3: Annealing
Final ~1T tokens at near-zero learning rate. Upweight highest-quality data. This "polishing" phase yields 1-2% improvement on benchmarks.
Scaling Law Predictions

See how validation loss decreases with model size and training tokens. The curves show power-law scaling: each doubling of model size or data gives diminishing but consistent returns. Drag the slider to adjust the compute budget and see the optimal model size vs data allocation.

Compute (FLOPs) 10^22

Learning rate schedule

PhaseLearning RateSchedule
Warmup0 → 8×10-5Linear, 8000 steps
Main8×10-5 → 8×10-7Cosine decay over ~15T tokens
Annealing8×10-7 → 0Linear decay over final ~1T tokens
Why did Meta "overtrain" the 8B model on 15T tokens (far more than the Chinchilla-optimal ~160B tokens)?

Chapter 5: Post-Training

Pre-training produces a powerful but "raw" model — it can complete text fluently but doesn't know how to follow instructions, refuse harmful requests, or engage in helpful dialogue. Post-training transforms this base model into an aligned assistant through multiple stages of fine-tuning.

The post-training pipeline

Stage 1: Supervised Fine-Tuning (SFT)
Train on ~27K high-quality human-written (prompt, response) pairs. Teaches the model the format of helpful, harmless responses.
Stage 2: Reward Model
Train a separate model to score responses by quality. Uses ~1M human preference comparisons (response A vs response B for same prompt).
Stage 3: DPO / Rejection Sampling
Optimize the SFT model using the reward model's signal. Multiple rounds, each using the latest model to generate new samples.
↻ Repeat stages 1-3 iteratively (6 rounds total)

Stage 1: SFT — Learning to follow instructions

SFT uses carefully curated prompt-response pairs. Meta emphasizes quality over quantity: they found that 27,540 high-quality examples outperformed 10x more lower-quality examples. The SFT data covers:

CategoryExamplesPurpose
General helpfulness~10KAnswer questions, follow instructions, explain concepts
Safety~5KRefuse harmful requests, provide safety context
Code~5KWrite, debug, explain code
Math/reasoning~4KStep-by-step problem solving
Multilingual~3KFollow instructions in non-English languages

Stage 2: Reward model

The reward model is initialized from the pre-trained base model and trained to predict which of two responses a human annotator would prefer. Given a prompt x and two responses yw (preferred) and yl (rejected):

LRM = -log σ(r(x, yw) - r(x, yl))

Where r(x, y) is the scalar reward score and σ is the sigmoid function. This is the Bradley-Terry model of pairwise preferences — the same used in chess ELO ratings.

Stage 3: DPO — Direct Preference Optimization

Instead of training a separate reward model and using PPO (as in RLHF), Llama 3 primarily uses DPO (Rafailov et al., 2023), which directly optimizes the policy using preference data:

LDPO = -log σ(β [log πθ(yw|x)/πref(yw|x) - log πθ(yl|x)/πref(yl|x)])

DPO avoids the instability of PPO training and is simpler to implement. The key idea: instead of fitting a reward model and then optimizing against it, DPO uses the language model itself as an implicit reward model.

The iterative recipe is the secret weapon. Meta runs 6 rounds of the full post-training pipeline. In each round, the latest model generates new responses, humans rate them, and the model is further refined. Each round improves on the previous by generating higher-quality training data from a better starting model. This is qualitatively different from running DPO once — it's a feedback loop where the model bootstraps its own improvement.
Post-Training Pipeline

Watch the iterative post-training process. Each round consists of SFT, reward model training, and DPO. The model quality improves with each round as better models generate better training data. Click "Next Round" to advance.

Round 0 — Base model

Rejection sampling

In addition to DPO, Meta uses rejection sampling: generate K responses from the current model, score each with the reward model, and keep only the best one for SFT. This is simpler than DPO and provides high-quality training examples for the next round of SFT.

python
# Rejection sampling: generate K responses, keep the best
def rejection_sample(model, reward_model, prompt, K=16):
    responses = [model.generate(prompt, temperature=0.7) for _ in range(K)]
    scores = [reward_model.score(prompt, r) for r in responses]
    best_idx = max(range(K), key=lambda i: scores[i])
    return responses[best_idx]  # highest-scoring response

# This is compute-intensive but produces very high quality
# training examples for the next SFT round
Why does Meta run 6 rounds of the post-training pipeline instead of just one round of SFT + DPO?

Chapter 6: Multimodal Extensions

The Llama 3 paper describes experiments extending the language model to handle images, video, and speech — turning it into a multimodal model. These extensions follow a common pattern: keep the pre-trained language model mostly frozen and train adapter modules that project other modalities into the language model's embedding space.

Image understanding

The vision extension uses a pre-trained image encoder (ViT-H) connected to the language model via a learned projection layer:

Image Encoder (ViT-H/14)
Pre-trained, frozen. Extracts patch embeddings from the image. Output: [N_patches, D_vision]
Projection Layer
Cross-attention adapter that maps vision features into the language model's embedding space. Trained during multimodal fine-tuning.
Llama 3 Language Model
Processes interleaved text and projected image tokens. Generates text responses.

The cross-attention adapter is key: instead of simply projecting image features through an MLP (as in LLaVA), Llama 3 uses cross-attention layers inserted between Transformer blocks. The image tokens serve as keys and values, while the text tokens serve as queries. This gives the language model fine-grained access to spatial information in the image.

Video understanding

Video is handled by sampling frames uniformly, encoding each with the image encoder, and concatenating the resulting tokens. For a 30-second video at 1 fps, this produces 30 × N_patches tokens — which quickly consumes the context window. Meta addresses this with temporal aggregation: adjacent frames' features are pooled to reduce the token count.

Speech understanding

Speech is encoded using a separate audio encoder (trained from scratch) that converts waveforms into discrete token sequences. These are interleaved with text tokens, allowing the model to understand and respond to spoken input.

Multimodal Architecture

See how different modalities connect to the Llama 3 language model through adapters. Click each modality to see its specific pathway. The language model backbone remains shared across all modalities.

The modular approach: By keeping the language model frozen and only training adapters, Meta avoids catastrophic forgetting — the model retains its full language capabilities while gaining multimodal understanding. This is also more compute-efficient than training a multimodal model from scratch.

Multimodal results

BenchmarkLlama 3 405BGPT-4VGemini 1.5 Pro
MMMU (val)64.563.162.2
VQAv279.077.2
ChartQA83.278.581.3
How does Llama 3 add image understanding without forgetting its language capabilities?

Chapter 7: Safety

A 405B open-weight model can be used for anything — including harmful purposes. Meta addresses this through multiple layers of safety measures, while acknowledging that open models present fundamentally different safety challenges than API-only models.

Llama Guard

Llama Guard is a separate safety classifier (based on Llama 3 8B) that classifies inputs and outputs into safe/unsafe categories. It acts as a guardrail that can be deployed alongside the main model:

User input
The raw prompt from the user
Llama Guard (input)
Classifies: is this prompt safe to respond to? Categories: violence, sexual content, CBRN, self-harm, ...
Llama 3 (main model)
Generates response (only if input is classified as safe)
Llama Guard (output)
Classifies: is this response safe to show? Catches harmful outputs even from safe prompts.

Llama Guard classifies across these safety categories:

CategoryExamples
S1: Violent crimesInstructions for weapons, terrorism, assault
S2: Non-violent crimesFraud, hacking, drug trafficking
S3: Sex-related crimesCSAM, trafficking, non-consensual content
S4: Child safetyContent exploiting or endangering minors
S5: DefamationFalse statements damaging reputation
S6: Regulated adviceUnauthorized legal, medical, financial advice
S7: PrivacyPII exposure, stalking, doxxing
S8: IP violationsCopyright infringement, trademark abuse
The open model safety dilemma: With a closed model (API), the provider can enforce safety at the server. With an open model, anyone can remove the safety training (fine-tune the model on harmful data) or bypass Llama Guard (just don't use it). Meta's position is that the benefits of open models (transparency, research access, democratization) outweigh the incremental risk, since the underlying capabilities (chemistry, biology, etc.) are already freely available in textbooks and on the internet.

Safety training in post-training

Beyond Llama Guard (which is an external classifier), safety is also baked into the model itself during post-training:

MethodMechanism
Safety SFT data~5K examples of appropriate refusals and safety-aware responses
Red teamingAdversarial attacks by human experts to find failure modes before release
Safety DPOPreference data specifically targeting safety: harmful responses are labeled as "rejected"
System promptDefault instructions that guide the model toward safe behavior
Safety Pipeline Visualizer

See how a user query flows through the safety system. Both input and output are checked by Llama Guard. Click "Test Safe" or "Test Unsafe" to see how the system handles different types of queries.

Click to test
How does Llama Guard differ from the safety training built into the model during post-training?

Chapter 8: Model Explorer

Let's put it all together. This interactive explorer lets you compare all three Llama 3 model sizes across benchmarks, explore the architecture at different scales, and see how the training pipeline transforms a base model into an aligned assistant.

Llama 3 Benchmark Dashboard

Compare Llama 3 models (8B, 70B, 405B) against each other and against GPT-4 across major benchmarks. Click model names to toggle their visibility. Hover over bars to see exact scores.

Benchmark results

Benchmark8B70B405BGPT-4
MMLU (5-shot)68.482.087.386.4
GSM8K (8-shot CoT)79.695.196.892.0
MATH (4-shot)30.050.473.852.9
HumanEval72.680.589.086.6
ARC-Challenge83.493.096.996.4
GPQA32.846.751.141.4
Where Llama 3 405B beats GPT-4: MMLU (87.3 vs 86.4), GSM8K (96.8 vs 92.0), MATH (73.8 vs 52.9), HumanEval (89.0 vs 86.6), GPQA (51.1 vs 41.4). The math and coding results are particularly notable — Llama 3 405B significantly outperforms GPT-4 on mathematical reasoning. Where it falls behind: instruction following, creative writing, and complex multi-turn dialogue.

Parameter efficiency across scales

How efficiently does each model use its parameters? We can measure this as benchmark score per billion parameters:

ModelMMLUMMLU/Billion ParamsTokens SeenFLOPs
8B68.48.5515T~1.8×1024
70B82.01.1715T~1.6×1025
405B87.30.2215T~3.8×1025

The 8B model is by far the most parameter-efficient — each additional billion parameters yields diminishing returns. This is why the 8B model is the most popular for deployment: it offers 80% of the 405B's quality at 2% of the inference cost.

On which benchmark category does Llama 3 405B most significantly outperform GPT-4, and by how much?

Chapter 9: Connections

Llama 3 represents the culmination of years of research in scaling, training recipes, and alignment. Let's place it in the broader context.

What Llama 3 builds on

FoundationContribution
Transformer (2017)The decoder-only architecture Llama 3 uses
Chinchilla (2022)Scaling laws that guided compute allocation decisions
GQA (Ainslie 2023)Grouped query attention for efficient KV caching
SwiGLU (Shazeer 2020)Gated FFN activation that consistently improves quality
RoPE (Su 2021)Rotary position embeddings for length generalization
DPO (Rafailov 2023)Direct preference optimization used in post-training
RLHF (Ouyang 2022)The reward model + optimization paradigm for alignment

The open model ecosystem

Model FamilyKey FeatureRelation to Llama 3
Mistral / MixtralMixture of experts (MoE)Alternative scaling strategy — more params, less FLOPs per token
Qwen 2Strong multilingualCompetitive open model from Alibaba, similar architecture
Gemma 2Google open weightsSmaller models (2B-27B), similar philosophy
Phi-3Small model excellenceMicrosoft's bet on data quality over model size
DeepSeek-V2MoE + MLA attentionInnovative attention mechanism alternative to GQA

Key innovations and their impact

Llama 3's lasting contributions:
1. Scaling data matters as much as scaling models. 15T tokens on 8B params produces better results than 1.5T on 70B for many tasks.
2. Iterative post-training is crucial. 6 rounds of SFT→RM→DPO, each bootstrapping from the improved model.
3. Standard architectures work. No exotic modifications needed — GQA, SwiGLU, and RoPE are all established techniques combined carefully.
4. Open models can match closed ones. 405B approaches GPT-4 on most benchmarks, proving that the "secret sauce" is primarily scale and data quality, not proprietary architectural innovations.

The future trajectory

The Llama series shows a clear trajectory: each generation gets more data (1.4T → 2T → 15T), larger models (65B → 70B → 405B), and more sophisticated post-training (no RLHF → RLHF → iterative DPO). If this trend continues, Llama 4 will likely feature even more training data (50T+?), longer context (1M+?), native multimodality, and potentially mixture-of-experts architectures for efficiency.

"Our experience suggests that it is possible to train models at the frontier of AI capabilities using a standard, dense Transformer architecture. Neither mixture-of-experts models, nor a novel architecture, are required."
— Llama 3 Technical Report, Meta AI

What is the most important lesson from the Llama 1→2→3 progression for the field of AI?