Llama 3 (Meta 2024)

Chapter 0: Why Open Models Matter

In 2024, the most capable language models — GPT-4, Claude 3, Gemini Ultra — are proprietary. You send your data to an API, pay per token, and hope the provider doesn't change the model, raise prices, or discontinue it. You can't inspect the weights, can't fine-tune on your private data, can't run it on your own hardware, and can't verify what it's doing with your inputs.

Meta's Llama 3 challenges this paradigm. By releasing open weights at 8B, 70B, and 405B parameters, Meta made a bet: the best way to advance AI is to let everyone experiment, fine-tune, and build on top of the same foundation. The 405B model is particularly significant — it's the first open model that approaches the performance of GPT-4 and Claude 3.5 on major benchmarks.

Why release a 405B model openly? Meta argues that open models accelerate the entire field: researchers can study scaling behavior, practitioners can fine-tune for specialized domains, and safety researchers can probe for risks. Closed models advance one company; open models advance the ecosystem. Whether this is genuine altruism, competitive strategy, or both — the practical impact is that anyone with sufficient hardware can run a GPT-4-class model locally.

Model	Params	Context	Open?	MMLU (5-shot)
Llama 3 8B	8B	128K	Yes	68.4
Llama 3 70B	70B	128K	Yes	82.0
Llama 3 405B	405B	128K	Yes	87.3
GPT-4 (0125)	~1.8T (est.)	128K	No	86.4
Claude 3.5 Sonnet	Unknown	200K	No	88.7

The Llama 3 paper is unusual: it's not just a model release — it's a 92-page technical report that documents the entire pipeline from data curation to post-training to safety evaluation. It's essentially a textbook on how to train a modern large language model at scale. We'll work through the key components.

The Open Model Landscape

Compare Llama 3 against other models. The x-axis shows parameter count (log scale), the y-axis shows MMLU performance. Open models are shown in teal; closed models in orange. Notice how Llama 3 405B closes the gap with proprietary models.

The Llama lineage

Version	Date	Sizes	Training tokens	Key change
Llama 1	Feb 2023	7B-65B	1.4T	First open competitive model
Llama 2	Jul 2023	7B-70B	2T	RLHF, chat-tuned, 4K context
Llama 3	Jul 2024	8B-405B	15T	128K context, 405B scale, multimodal

The jump from Llama 2 to Llama 3 is dramatic: 7.5x more training tokens (2T → 15T), new architectural decisions (GQA, SwiGLU), massive scale-up to 405B parameters, and a comprehensive post-training pipeline. Let's understand each piece.

What makes the Llama 3 405B model historically significant?

It is the first open-weight model to approach the performance of proprietary frontier models like GPT-4 on major benchmarks, while being fully downloadable and fine-tunable — demonstrating that open models can compete with closed ones at sufficient scale It is the largest language model ever trained It is the first model to use the Transformer architecture

Chapter 1: Architecture Decisions

Llama 3 uses a standard decoder-only Transformer — the same fundamental architecture as GPT. But every component has been carefully optimized. Meta's philosophy: "Our design philosophy is to use a relatively standard Transformer architecture with minor modifications, rather than developing a bespoke architecture."

Here are the specific architecture choices, each with the engineering rationale:

Grouped Query Attention (GQA)

Standard multi-head attention (MHA) gives each head its own query, key, and value projections. With 128 heads in the 405B model, that's 128 separate key-value pairs per layer — expensive during inference because all KV pairs must be cached.

Grouped Query Attention (Ainslie et al., 2023) shares key-value heads across groups of query heads. Instead of 128 KV heads, Llama 3 uses only 8 KV groups — each group of 16 query heads shares the same keys and values.

MHA: Q_i, K_i, V_i for each head i (128 KV heads)

GQA: Q_i unique, K_g(i), V_g(i) shared within group (8 KV heads)

Why GQA? During autoregressive inference, the model must cache all previous key-value pairs (the "KV cache"). With 128 heads at 128-dim each, the cache for one layer is 128 × 128 × 2 (K+V) × sequence_length × bytes_per_element. GQA reduces this by 16x (128/8), making long-context inference feasible. The quality loss vs full MHA is minimal (< 0.5% on benchmarks) because the queries still have full expressiveness — only the keys and values are shared.

python
# Grouped Query Attention — 8 KV heads shared across 128 query heads
import torch
import torch.nn as nn

class GQA(nn.Module):
    def __init__(self, d_model=16384, n_heads=128, n_kv_heads=8):
        super().__init__()
        self.head_dim = d_model // n_heads  # 128
        self.n_heads = n_heads              # 128 query heads
        self.n_kv_heads = n_kv_heads        # 8 KV groups
        self.n_rep = n_heads // n_kv_heads  # 16 queries per KV group

        self.wq = nn.Linear(d_model, n_heads * self.head_dim, bias=False)
        self.wk = nn.Linear(d_model, n_kv_heads * self.head_dim, bias=False)
        self.wv = nn.Linear(d_model, n_kv_heads * self.head_dim, bias=False)
        self.wo = nn.Linear(n_heads * self.head_dim, d_model, bias=False)

    def forward(self, x):
        B, T, _ = x.shape
        q = self.wq(x).view(B, T, self.n_heads, self.head_dim)
        k = self.wk(x).view(B, T, self.n_kv_heads, self.head_dim)
        v = self.wv(x).view(B, T, self.n_kv_heads, self.head_dim)
        # Repeat KV heads to match query heads
        k = k.repeat_interleave(self.n_rep, dim=2)  # [B,T,128,128]
        v = v.repeat_interleave(self.n_rep, dim=2)  # [B,T,128,128]
        # Standard attention from here
        # ...

SwiGLU Feed-Forward Network

The original Transformer uses a two-layer FFN with ReLU: FFN(x) = ReLU(xW₁)W₂. Llama 3 replaces this with SwiGLU (Shazeer 2020), which uses a gated linear unit with the SiLU (Swish) activation:

FFN_SwiGLU(x) = (SiLU(x W_gate) ⊙ x W_up) W_down

Where ⊙ is element-wise multiplication and SiLU(x) = x · σ(x) is the Sigmoid Linear Unit. This introduces a third weight matrix (W_gate) but consistently improves quality. The gating mechanism lets the network learn which dimensions to activate, which is more expressive than a simple ReLU.

RoPE (Rotary Position Embeddings)

Instead of learned absolute position embeddings (like BERT) or sinusoidal embeddings (like the original Transformer), Llama 3 uses RoPE (Su et al., 2021). RoPE encodes position by rotating the query and key vectors in 2D subspaces:

RoPE(x, pos) = x · cos(pos · θ) + rotate(x) · sin(pos · θ)

Where θ is a frequency vector and rotate() shifts dimensions by one position. The key property: the attention score between positions i and j depends only on their relative distance (i - j), not their absolute positions. This makes the model naturally length-generalizable — it can handle sequences longer than those seen during training.

RoPE's elegance: It achieves relative position encoding through rotation, not addition. Adding position information (like BERT) contaminates the content representation. Rotating it preserves the magnitude of the content vector while encoding position in the angle. And because rotation is linear, it plays nicely with the linear projections in attention.

Other architecture details

Component	Choice	Rationale
Normalization	RMSNorm (pre-norm)	Faster than LayerNorm (no mean computation), pre-norm is more stable for deep networks
Vocabulary	128K tokens (BPE via tiktoken)	Larger vocab = fewer tokens per text = more efficient processing
Context length	128K tokens	Extended from 8K via RoPE frequency scaling in later training stages
Attention bias	None (no bias in QKV projections)	Reduces parameters, no quality loss
Embedding tying	No (separate input/output embeddings)	Separate embeddings improve quality at 405B scale

Llama 3 Architecture Diagram

Explore the Llama 3 Transformer block. Click on each component to see its details: GQA attention, SwiGLU FFN, RMSNorm, and RoPE. The 405B model stacks 126 of these blocks.

Model sizes at a glance

Parameter	8B	70B	405B
Layers	32	80	126
Hidden dim	4096	8192	16384
Query heads	32	64	128
KV heads	8	8	8
FFN dim	14336	28672	53248
Head dim	128	128	128
Vocab	128K	128K	128K

Why does Llama 3 use Grouped Query Attention (GQA) with only 8 KV heads instead of full multi-head attention with 128 KV heads?

GQA reduces the KV cache memory by 16x during inference (128→8 KV heads), making long-context generation feasible, while the quality loss is minimal (<0.5%) because query heads retain full expressiveness — only the keys and values are shared within groups GQA is faster to train because it uses fewer parameters GQA produces better attention patterns than standard multi-head attention

Chapter 2: Pre-training Data Pipeline

Llama 3 was trained on approximately 15 trillion tokens — roughly 10x more than Llama 2 and 5x more than GPT-3. At the 405B scale, this means the model saw each token roughly 1.5 times on average (15T tokens / ~10T unique tokens). The data pipeline that produces these tokens is arguably as important as the model architecture itself.

Data composition

Source	Share	Tokens	Description
Web crawl	~85%	~12.7T	Filtered CommonCrawl — the backbone. Multiple rounds of dedup and quality filtering.
Code	~8%	~1.2T	GitHub, StackOverflow, code-heavy web pages. 10+ programming languages.
Books	~3%	~0.45T	Curated book collections, Project Gutenberg, academic texts.
Academic	~2.5%	~0.38T	arXiv, PubMed, academic web pages, scientific papers.
Wikipedia + encyclopedias	~1.5%	~0.23T	Wikipedia in 30+ languages, encyclopedic sources.

Why 15 trillion tokens? Llama 3 follows the Chinchilla scaling laws (Hoffmann et al. 2022), which predict that optimal training uses ~20 tokens per parameter. For 405B parameters, that's 8.1T tokens at the Chinchilla optimal. Meta went 1.85x beyond this because they found that more tokens continue to improve quality, even past the theoretical optimum — especially for smaller models (8B and 70B) where additional tokens compensate for limited model capacity.

Data quality pipeline

Raw CommonCrawl is noisy — full of spam, duplicated content, toxic text, and low-quality pages. Llama 3's data pipeline applies multiple cleaning stages:

Stage 1: URL filtering

Remove known spam domains, adult content, phishing sites. Maintain domain allow/deny lists.

↓

Stage 2: Text extraction

Parse HTML, remove boilerplate (nav bars, footers, ads). Keep only the main content.

↓

Stage 3: Language ID

FastText classifier identifies language. Keep English and 30+ other languages at calibrated ratios.

↓

Stage 4: Quality classifier

A BERT-based classifier scores each document for quality (trained on Wikipedia + reference documents vs random web). Low-quality documents are removed or downsampled.

↓

Stage 5: Deduplication

MinHash-based near-duplicate removal at document level. Exact n-gram dedup at line level. Removes ~50% of remaining data.

↓

Stage 6: Safety filtering

Remove PII (names, emails, phone numbers), toxic content, and CSAM. Multiple classifiers + heuristics.

Data Pipeline Funnel

Watch how raw web crawl data is progressively filtered into high-quality training tokens. Each stage removes a portion of the data. The funnel shows approximate data volumes at each stage.

Click to see filtering

Data mixing and annealing

Not all data is created equal. Meta uses data annealing — changing the data mixture during training:

Training Phase	Data Mix	Rationale
Main training (first ~14T)	85% web, 8% code, 3% books, ...	Broad knowledge acquisition
Annealing (final ~1T)	Upweight high-quality sources	Polish quality on the best data
Long-context extension	Extend to 128K with long documents	Teach long-range dependencies

The annealing phase is particularly clever: in the final stages of training, the learning rate is reduced to near-zero and the data mixture is shifted to emphasize the highest-quality documents. This is analogous to "polishing" a lens — the final refinement that sharpens quality without the risk of large updates destabilizing the model.

Tokenization: 128K BPE vocabulary

Llama 3 quadrupled the vocabulary from 32K (Llama 2) to 128K tokens. This means each piece of text is encoded in fewer tokens, which:

Benefit	Mechanism	Impact
Faster inference	Fewer tokens to generate	~15% fewer tokens for English text
More information per step	Each token carries more meaning	Better for code, math, multilingual
Better multilingual	Non-English scripts get more tokens	Significantly fewer tokens for CJK, Arabic, etc.

Why did Meta train Llama 3 on 15T tokens — nearly 2x the Chinchilla-optimal amount for 405B parameters?

Because they found that more tokens continue to improve quality even past the Chinchilla optimum, and the extra tokens especially help smaller models (8B, 70B) where additional data compensates for limited model capacity Because they had more compute available than needed Because the Chinchilla scaling laws only apply to smaller models

Chapter 3: Training Infrastructure

Training a 405B parameter model on 15T tokens is one of the largest compute operations ever undertaken. Let's understand the engineering required.

Hardware

Llama 3 405B was trained on 16,384 NVIDIA H100 GPUs, organized into clusters connected by high-bandwidth networking:

Component	Specification
GPUs	16,384 × NVIDIA H100 80GB
GPU memory	~1.3 PB total (16K × 80 GB)
Interconnect (intra-node)	NVLink 4.0, 900 GB/s per GPU
Interconnect (inter-node)	400 Gbps InfiniBand per GPU
Total compute	~3.8 × 10²⁵ FLOPs
Training time	~54 days for 405B

The scale in perspective: 16,384 H100 GPUs running for 54 days consumes approximately 30 GWh of electricity — roughly the annual electricity consumption of 3,000 US homes. The hardware alone (not including networking, cooling, or storage) costs approximately $500M at list price. This is why open-weight releases matter: very few organizations can afford to train a model of this scale.

Parallelism strategy

A 405B model with fp16 weights requires ~810 GB just for the parameters. A single H100 has 80 GB of memory. You can't fit even the weights on one GPU, let alone the optimizer states and activations. The solution: split the model across thousands of GPUs using multiple parallelism strategies simultaneously.

Tensor Parallelism (TP=8)

Split each layer's weight matrices across 8 GPUs within a node. Each GPU computes a slice of the attention/FFN and they communicate via NVLink (fast, 900 GB/s).

Pipeline Parallelism (PP=16)

Split the 126 layers across 16 pipeline stages (~8 layers each). Micro-batch pipelining keeps all stages busy. Communicates between nodes via InfiniBand.

Data Parallelism (DP=128)

128 replicas of the full pipeline, each processing different data. Gradients are averaged across replicas via all-reduce.

Total GPUs = TP × PP × DP = 8 × 16 × 128 = 16,384

3D Parallelism Visualizer

See how 16,384 GPUs are organized. Tensor parallelism splits within a node (8 GPUs). Pipeline parallelism chains nodes into stages. Data parallelism replicates the full pipeline. Click each strategy to highlight it.

Training stability

At this scale, hardware failures are the norm, not the exception. With 16K GPUs running for 54 days:

Challenge	Frequency	Mitigation
GPU failure	~Daily	Automated detection + restart from checkpoint
Network issue	Several per day	Redundant paths, graceful reconnection
Loss spike	~3 during 405B training	Roll back to earlier checkpoint, skip problematic data
NaN in gradients	Rare but catastrophic	Gradient clipping, loss scaling, checkpoint recovery

Meta reports that during the 405B training run, they encountered 466 job interruptions, of which 78% were due to unexpected hardware issues. Their automated recovery system was able to restart from the latest checkpoint within minutes, achieving an effective training uptime of ~90%.

Why does training the 405B model require three different parallelism strategies (tensor, pipeline, and data parallelism) simultaneously?

Because the 405B model is too large to fit on a single GPU (810 GB weights vs 80 GB per GPU), so tensor parallelism splits layers across GPUs within a node, pipeline parallelism distributes layers across nodes, and data parallelism provides throughput by processing different data on replicated pipelines — together they utilize all 16,384 GPUs efficiently Because each strategy handles a different type of data Because it makes the model more accurate

Chapter 4: Scaling Laws & Training Recipe

How do you decide whether to train an 8B, 70B, or 405B model? How many tokens should each see? How do you set the learning rate and batch size? Meta used scaling laws — empirical relationships between model size, data size, and compute — to answer these questions before committing to the full training runs.

The scaling law methodology

Meta trained hundreds of small models (40M to 16B parameters) on varying amounts of data and measured their validation loss. They then fit power-law curves to predict the performance of larger models:

L(N, D) = A / N^α + B / D^β + L_∞

Where N is the number of parameters, D is the number of training tokens, A, B, α, β are fitted constants, and L_∞ is the irreducible loss (the best possible loss with infinite data and model size). This equation has three terms:

Term	Meaning
A / N^α	Loss reduction from model size — bigger models memorize more patterns
B / D^β	Loss reduction from data — more data prevents overfitting
L_∞	Irreducible loss — the entropy of natural language itself

A critical finding: Meta's scaling experiments showed that the Chinchilla-optimal compute allocation (roughly equal investment in model size and data) is not optimal when inference cost matters. Since you only train once but serve the model millions of times, it's better to "overtrain" a smaller model on more data — getting a model that's cheaper to serve at similar quality. This is why the 8B model was trained on 15T tokens (1875 tokens per parameter) rather than the Chinchilla-optimal ~160B tokens (20 tokens per parameter).

Training recipe

Phase 1: Initial pre-training

Standard training on ~14T tokens with 8K context. Batch size ramps from small to 16M tokens. Learning rate warms up then follows cosine decay.

↓

Phase 2: Long-context extension

Gradually extend context from 8K → 128K tokens over ~800B tokens. RoPE base frequency increased from 500K → 8M to support longer sequences.

↓

Phase 3: Annealing

Final ~1T tokens at near-zero learning rate. Upweight highest-quality data. This "polishing" phase yields 1-2% improvement on benchmarks.

Scaling Law Predictions

See how validation loss decreases with model size and training tokens. The curves show power-law scaling: each doubling of model size or data gives diminishing but consistent returns. Drag the slider to adjust the compute budget and see the optimal model size vs data allocation.

Compute (FLOPs) 10^22

Learning rate schedule

Phase	Learning Rate	Schedule
Warmup	0 → 8×10^-5	Linear, 8000 steps
Main	8×10^-5 → 8×10^-7	Cosine decay over ~15T tokens
Annealing	8×10^-7 → 0	Linear decay over final ~1T tokens

Why did Meta "overtrain" the 8B model on 15T tokens (far more than the Chinchilla-optimal ~160B tokens)?

Because when inference cost matters (you train once but serve millions of times), it's better to train a smaller model on more data — getting comparable quality at much cheaper serving cost. The 8B model at 15T tokens is Pareto-better for deployment than a hypothetical 80B model at 1.6T tokens at the same compute budget. Because they ran out of larger GPUs to train a bigger model Because 15T is the standard amount for all language models

Chapter 5: Post-Training

Pre-training produces a powerful but "raw" model — it can complete text fluently but doesn't know how to follow instructions, refuse harmful requests, or engage in helpful dialogue. Post-training transforms this base model into an aligned assistant through multiple stages of fine-tuning.

The post-training pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Train on ~27K high-quality human-written (prompt, response) pairs. Teaches the model the format of helpful, harmless responses.

↓

Stage 2: Reward Model

Train a separate model to score responses by quality. Uses ~1M human preference comparisons (response A vs response B for same prompt).

↓

Stage 3: DPO / Rejection Sampling

Optimize the SFT model using the reward model's signal. Multiple rounds, each using the latest model to generate new samples.

↻ Repeat stages 1-3 iteratively (6 rounds total)

Stage 1: SFT — Learning to follow instructions

SFT uses carefully curated prompt-response pairs. Meta emphasizes quality over quantity: they found that 27,540 high-quality examples outperformed 10x more lower-quality examples. The SFT data covers:

Category	Examples	Purpose
General helpfulness	~10K	Answer questions, follow instructions, explain concepts
Safety	~5K	Refuse harmful requests, provide safety context
Code	~5K	Write, debug, explain code
Math/reasoning	~4K	Step-by-step problem solving
Multilingual	~3K	Follow instructions in non-English languages

Stage 2: Reward model

The reward model is initialized from the pre-trained base model and trained to predict which of two responses a human annotator would prefer. Given a prompt x and two responses y_w (preferred) and y_l (rejected):

L_RM = -log σ(r(x, y_w) - r(x, y_l))

Where r(x, y) is the scalar reward score and σ is the sigmoid function. This is the Bradley-Terry model of pairwise preferences — the same used in chess ELO ratings.

Stage 3: DPO — Direct Preference Optimization

Instead of training a separate reward model and using PPO (as in RLHF), Llama 3 primarily uses DPO (Rafailov et al., 2023), which directly optimizes the policy using preference data:

L_DPO = -log σ(β [log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)])

DPO avoids the instability of PPO training and is simpler to implement. The key idea: instead of fitting a reward model and then optimizing against it, DPO uses the language model itself as an implicit reward model.

The iterative recipe is the secret weapon. Meta runs 6 rounds of the full post-training pipeline. In each round, the latest model generates new responses, humans rate them, and the model is further refined. Each round improves on the previous by generating higher-quality training data from a better starting model. This is qualitatively different from running DPO once — it's a feedback loop where the model bootstraps its own improvement.

Post-Training Pipeline

Watch the iterative post-training process. Each round consists of SFT, reward model training, and DPO. The model quality improves with each round as better models generate better training data. Click "Next Round" to advance.

Round 0 — Base model

Rejection sampling

In addition to DPO, Meta uses rejection sampling: generate K responses from the current model, score each with the reward model, and keep only the best one for SFT. This is simpler than DPO and provides high-quality training examples for the next round of SFT.

python
# Rejection sampling: generate K responses, keep the best
def rejection_sample(model, reward_model, prompt, K=16):
    responses = [model.generate(prompt, temperature=0.7) for _ in range(K)]
    scores = [reward_model.score(prompt, r) for r in responses]
    best_idx = max(range(K), key=lambda i: scores[i])
    return responses[best_idx]  # highest-scoring response

# This is compute-intensive but produces very high quality
# training examples for the next SFT round

Why does Meta run 6 rounds of the post-training pipeline instead of just one round of SFT + DPO?

Because each round generates higher-quality training data from the improved model — creating a bootstrapping loop where better models produce better examples, which produce even better models. This iterative refinement is qualitatively different from a single pass. Because each round trains on different tasks Because 6 is the optimal number according to scaling laws

Chapter 6: Multimodal Extensions

The Llama 3 paper describes experiments extending the language model to handle images, video, and speech — turning it into a multimodal model. These extensions follow a common pattern: keep the pre-trained language model mostly frozen and train adapter modules that project other modalities into the language model's embedding space.

Image understanding

The vision extension uses a pre-trained image encoder (ViT-H) connected to the language model via a learned projection layer:

Image Encoder (ViT-H/14)

Pre-trained, frozen. Extracts patch embeddings from the image. Output: [N_patches, D_vision]

↓

Projection Layer

Cross-attention adapter that maps vision features into the language model's embedding space. Trained during multimodal fine-tuning.

↓

Llama 3 Language Model

Processes interleaved text and projected image tokens. Generates text responses.

The cross-attention adapter is key: instead of simply projecting image features through an MLP (as in LLaVA), Llama 3 uses cross-attention layers inserted between Transformer blocks. The image tokens serve as keys and values, while the text tokens serve as queries. This gives the language model fine-grained access to spatial information in the image.

Video understanding

Video is handled by sampling frames uniformly, encoding each with the image encoder, and concatenating the resulting tokens. For a 30-second video at 1 fps, this produces 30 × N_patches tokens — which quickly consumes the context window. Meta addresses this with temporal aggregation: adjacent frames' features are pooled to reduce the token count.

Speech understanding

Speech is encoded using a separate audio encoder (trained from scratch) that converts waveforms into discrete token sequences. These are interleaved with text tokens, allowing the model to understand and respond to spoken input.

Multimodal Architecture

See how different modalities connect to the Llama 3 language model through adapters. Click each modality to see its specific pathway. The language model backbone remains shared across all modalities.

The modular approach: By keeping the language model frozen and only training adapters, Meta avoids catastrophic forgetting — the model retains its full language capabilities while gaining multimodal understanding. This is also more compute-efficient than training a multimodal model from scratch.

Multimodal results

Benchmark	Llama 3 405B	GPT-4V	Gemini 1.5 Pro
MMMU (val)	64.5	63.1	62.2
VQAv2	79.0	77.2	—
ChartQA	83.2	78.5	81.3

How does Llama 3 add image understanding without forgetting its language capabilities?

By keeping the pre-trained language model frozen and only training cross-attention adapter layers that project image features (from a frozen ViT encoder) into the language model's embedding space — this preserves language capabilities while adding vision understanding By training the entire model from scratch on image-text pairs By converting images into text descriptions before processing

Chapter 7: Safety

A 405B open-weight model can be used for anything — including harmful purposes. Meta addresses this through multiple layers of safety measures, while acknowledging that open models present fundamentally different safety challenges than API-only models.

Llama Guard

Llama Guard is a separate safety classifier (based on Llama 3 8B) that classifies inputs and outputs into safe/unsafe categories. It acts as a guardrail that can be deployed alongside the main model:

User input

The raw prompt from the user

↓

Llama Guard (input)

Classifies: is this prompt safe to respond to? Categories: violence, sexual content, CBRN, self-harm, ...

↓

Llama 3 (main model)

Generates response (only if input is classified as safe)

↓

Llama Guard (output)

Classifies: is this response safe to show? Catches harmful outputs even from safe prompts.

Llama Guard classifies across these safety categories:

Category	Examples
S1: Violent crimes	Instructions for weapons, terrorism, assault
S2: Non-violent crimes	Fraud, hacking, drug trafficking
S3: Sex-related crimes	CSAM, trafficking, non-consensual content
S4: Child safety	Content exploiting or endangering minors
S5: Defamation	False statements damaging reputation
S6: Regulated advice	Unauthorized legal, medical, financial advice
S7: Privacy	PII exposure, stalking, doxxing
S8: IP violations	Copyright infringement, trademark abuse

The open model safety dilemma: With a closed model (API), the provider can enforce safety at the server. With an open model, anyone can remove the safety training (fine-tune the model on harmful data) or bypass Llama Guard (just don't use it). Meta's position is that the benefits of open models (transparency, research access, democratization) outweigh the incremental risk, since the underlying capabilities (chemistry, biology, etc.) are already freely available in textbooks and on the internet.

Safety training in post-training

Beyond Llama Guard (which is an external classifier), safety is also baked into the model itself during post-training:

Method	Mechanism
Safety SFT data	~5K examples of appropriate refusals and safety-aware responses
Red teaming	Adversarial attacks by human experts to find failure modes before release
Safety DPO	Preference data specifically targeting safety: harmful responses are labeled as "rejected"
System prompt	Default instructions that guide the model toward safe behavior

Safety Pipeline Visualizer

See how a user query flows through the safety system. Both input and output are checked by Llama Guard. Click "Test Safe" or "Test Unsafe" to see how the system handles different types of queries.

Click to test

How does Llama Guard differ from the safety training built into the model during post-training?

Llama Guard is a separate classifier model that checks inputs and outputs externally (and can be bypassed by removing it), while safety training (SFT data, safety DPO, red teaming) is baked into the model's weights during post-training — they provide complementary defense layers Llama Guard is faster but less accurate than post-training safety They are the same thing — Llama Guard IS the post-training safety

Chapter 8: Model Explorer

Let's put it all together. This interactive explorer lets you compare all three Llama 3 model sizes across benchmarks, explore the architecture at different scales, and see how the training pipeline transforms a base model into an aligned assistant.

Llama 3 Benchmark Dashboard

Compare Llama 3 models (8B, 70B, 405B) against each other and against GPT-4 across major benchmarks. Click model names to toggle their visibility. Hover over bars to see exact scores.

Benchmark results

Benchmark	8B	70B	405B	GPT-4
MMLU (5-shot)	68.4	82.0	87.3	86.4
GSM8K (8-shot CoT)	79.6	95.1	96.8	92.0
MATH (4-shot)	30.0	50.4	73.8	52.9
HumanEval	72.6	80.5	89.0	86.6
ARC-Challenge	83.4	93.0	96.9	96.4
GPQA	32.8	46.7	51.1	41.4

Where Llama 3 405B beats GPT-4: MMLU (87.3 vs 86.4), GSM8K (96.8 vs 92.0), MATH (73.8 vs 52.9), HumanEval (89.0 vs 86.6), GPQA (51.1 vs 41.4). The math and coding results are particularly notable — Llama 3 405B significantly outperforms GPT-4 on mathematical reasoning. Where it falls behind: instruction following, creative writing, and complex multi-turn dialogue.

Parameter efficiency across scales

How efficiently does each model use its parameters? We can measure this as benchmark score per billion parameters:

Model	MMLU	MMLU/Billion Params	Tokens Seen	FLOPs
8B	68.4	8.55	15T	~1.8×10²⁴
70B	82.0	1.17	15T	~1.6×10²⁵
405B	87.3	0.22	15T	~3.8×10²⁵

The 8B model is by far the most parameter-efficient — each additional billion parameters yields diminishing returns. This is why the 8B model is the most popular for deployment: it offers 80% of the 405B's quality at 2% of the inference cost.

On which benchmark category does Llama 3 405B most significantly outperform GPT-4, and by how much?

Mathematical reasoning — MATH benchmark: 73.8 vs 52.9 (a 20.9 point gap), and GPQA: 51.1 vs 41.4. The combination of more training data, larger model, and SFT on math-specific data gives Llama 3 a significant advantage on quantitative tasks. Natural language understanding — MMLU: 87.3 vs 86.4 Code generation — HumanEval: 89.0 vs 86.6

Chapter 9: Connections

Llama 3 represents the culmination of years of research in scaling, training recipes, and alignment. Let's place it in the broader context.

What Llama 3 builds on

Foundation	Contribution
Transformer (2017)	The decoder-only architecture Llama 3 uses
Chinchilla (2022)	Scaling laws that guided compute allocation decisions
GQA (Ainslie 2023)	Grouped query attention for efficient KV caching
SwiGLU (Shazeer 2020)	Gated FFN activation that consistently improves quality
RoPE (Su 2021)	Rotary position embeddings for length generalization
DPO (Rafailov 2023)	Direct preference optimization used in post-training
RLHF (Ouyang 2022)	The reward model + optimization paradigm for alignment

The open model ecosystem

Model Family	Key Feature	Relation to Llama 3
Mistral / Mixtral	Mixture of experts (MoE)	Alternative scaling strategy — more params, less FLOPs per token
Qwen 2	Strong multilingual	Competitive open model from Alibaba, similar architecture
Gemma 2	Google open weights	Smaller models (2B-27B), similar philosophy
Phi-3	Small model excellence	Microsoft's bet on data quality over model size
DeepSeek-V2	MoE + MLA attention	Innovative attention mechanism alternative to GQA

Key innovations and their impact

Llama 3's lasting contributions:
1. Scaling data matters as much as scaling models. 15T tokens on 8B params produces better results than 1.5T on 70B for many tasks.
2. Iterative post-training is crucial. 6 rounds of SFT→RM→DPO, each bootstrapping from the improved model.
3. Standard architectures work. No exotic modifications needed — GQA, SwiGLU, and RoPE are all established techniques combined carefully.
4. Open models can match closed ones. 405B approaches GPT-4 on most benchmarks, proving that the "secret sauce" is primarily scale and data quality, not proprietary architectural innovations.

The future trajectory

The Llama series shows a clear trajectory: each generation gets more data (1.4T → 2T → 15T), larger models (65B → 70B → 405B), and more sophisticated post-training (no RLHF → RLHF → iterative DPO). If this trend continues, Llama 4 will likely feature even more training data (50T+?), longer context (1M+?), native multimodality, and potentially mixture-of-experts architectures for efficiency.

"Our experience suggests that it is possible to train models at the frontier of AI capabilities using a standard, dense Transformer architecture. Neither mixture-of-experts models, nor a novel architecture, are required."
— Llama 3 Technical Report, Meta AI

What is the most important lesson from the Llama 1→2→3 progression for the field of AI?

That bigger models always win That you need proprietary architectures for best results That frontier AI capabilities come from scaling data (1.4T→15T), careful post-training (iterative DPO), and engineering execution on standard architectures — not from proprietary innovations. Open models can match closed ones at sufficient scale.

The Llama 3 Herd of Models