Dettmers, Lewis, Belkada, Zettlemoyer (U. Washington / Meta AI / Hugging Face) — NeurIPS 2022

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

The paper that made 175B-parameter models fit on consumer GPUs — by discovering that outlier features break quantization and inventing mixed-precision decomposition to fix it.

Prerequisites: Matrix multiplication + How transformers work + What floating point numbers are. That's it.
10
Chapters
5+
Simulations
0
Assumed Knowledge

Chapter 0: The Memory Wall

You've trained a 175-billion-parameter language model. It took months of compute on thousands of GPUs. Now you want to use it — run inference, answer questions, generate text. One problem: the model's weights alone consume 350 GB of memory in FP16 (16-bit floating point). A single NVIDIA A100 GPU has 80 GB. You need at least 5 GPUs just to hold the model, before you've processed a single token.

This is the memory wall. For large language models at and beyond 6.7B parameters, the feed-forward and attention projection layers — the matrix multiplications — are responsible for 95% of all parameters and 65-85% of all computation. The bottleneck is not compute speed; it's how much model you can fit in memory.

What if you could represent each weight with 8 bits instead of 16? You'd cut memory in half. A 175B model would shrink from 350 GB to 175 GB — suddenly fitting on a single server with consumer GPUs. The arithmetic would be faster too: INT8 tensor cores can do roughly twice the throughput of FP16.

The catch: Prior to this paper, every attempt to quantize transformers larger than 350M parameters to 8-bit caused noticeable performance degradation. Methods that worked beautifully on BERT-sized models (350M params) simply broke when applied to models with billions of parameters. Nobody understood why.

Let's see the scale of the problem. Here's what each hardware setup can run in 16-bit vs. 8-bit precision:

HardwareGPU MemoryLargest Model (16-bit)Largest Model (8-bit)
8x A100 80GB640 GB totalOPT-175B / BLOOMOPT-175B / BLOOM
8x A100 40GB320 GB totalOPT-66BOPT-175B / BLOOM
8x RTX 3090192 GB totalOPT-66BOPT-175B / BLOOM
4x RTX 309096 GB totalOPT-30BOPT-66B
Colab Pro15 GBGPT-J-6BOPT-13B
Free Colab12 GBGPT-2 1.3BT0/T5-11B

Look at that last row. With 8-bit quantization, a free Google Colab notebook — something any student can access — can run an 11B parameter model. In 16-bit, you're limited to 1.3B. That's a nearly 10x improvement in accessible model size.

Memory Footprint by Precision

Each bar shows the GPU memory required to hold model weights. Click a model size to see the comparison.

6.7B model selected

Where the memory goes

Let's do the arithmetic. A 175B-parameter model stores each parameter as a 16-bit float = 2 bytes. Total: 175 × 109 × 2 bytes = 350 GB. But that's just the weights. During inference, you also need memory for:

ComponentWhat it storesApproximate size (175B)
Model weightsWQ, WK, WV, WO, Wup, Wdown for each layer350 GB (FP16) / 175 GB (INT8)
KV cacheKey and value tensors for all past tokensGrows with sequence length
ActivationsIntermediate hidden states during forward pass~1-5 GB (depends on batch size)
CUDA overheadFramework buffers, memory fragmentation~2-5 GB

The weights dominate. If you can halve the weight memory, you can fit the model on half the GPUs — or fit a model twice as large on the same hardware.

Dettmers et al. solved this problem. Their method, LLM.int8(), quantizes transformer weights to 8-bit integers with zero performance degradation — even at 175B parameters. The key insight? They discovered that quantization fails at scale because of a handful of emergent outlier features — values 20x larger than everything else — that appear in specific hidden dimensions once models exceed about 6.7B parameters. By isolating these outliers and handling them in 16-bit while quantizing everything else to 8-bit, they achieved lossless compression.

This wasn't just an engineering improvement. It was a scientific discovery about how transformers organize information at scale — and that discovery is what made the engineering possible.

Let's understand how, starting with the basics of quantization.

Why is the memory wall the primary bottleneck for LLM inference, rather than compute speed?

Chapter 1: Quantization 101

Before we can understand LLM.int8(), we need to understand what quantization means and why it's hard. The core idea is simple: represent numbers with fewer bits.

A 16-bit floating point number (FP16) can represent values from approximately -65,504 to +65,504 with varying precision. It uses 1 sign bit, 5 exponent bits, and 10 mantissa bits. This gives you fine-grained precision near zero and coarser precision for large values — exactly what you want for neural network weights, which tend to cluster near zero.

An 8-bit signed integer (INT8) can represent exactly 256 values: the integers from -128 to +127. That's it. No fractions, no exponents. Just 256 evenly spaced integers.

To feel the difference: FP16 can distinguish between 1.0000 and 1.0010. INT8 can only distinguish between 1 and 2. If a weight has value 1.3, INT8 must round it to either 1 or 2 — losing the fractional part. This rounding error is quantization error, and the entire challenge of model quantization is keeping it small enough that the model doesn't notice.

The quantization problem in one sentence: How do you map a continuous range of floating-point values (potentially thousands of distinct values in a weight tensor) down to just 256 discrete integers, while losing as little information as possible?

Think of it like compressing a high-resolution photograph into a 256-color palette. You need a mapping function that assigns each original color to its closest palette entry, and a reverse mapping (dequantization) that converts back. The quality of the compressed image depends entirely on how cleverly you choose those 256 colors.

For neural networks, there are two main approaches:

Symmetric (Absmax)
Map the range [-max, +max] to [-127, +127]. Zero maps to zero. Simple and fast, but wastes half the range if values are all positive.
↓ vs
Asymmetric (Zeropoint)
Map the range [min, max] to [-127, +127]. Uses the full INT8 range. More precise for skewed distributions, but requires extra bookkeeping.

Let's build these from scratch.

Why does quantization matter for matrix multiplication?

A transformer's core operation is matrix multiplication: hidden states X (shape [sequence_length, hidden_dim]) multiplied by weight matrices W (shape [hidden_dim, output_dim]). In FP16, each element of X and W is a 16-bit float. The multiplication produces a 32-bit output that's then cast back to FP16.

If we quantize both X and W to INT8, we halve the memory for storing weights and can use INT8 tensor cores, which are roughly 2x faster than FP16 cores on modern GPUs. The INT8 multiplication produces INT32 outputs (because two 8-bit integers multiplied can be up to 127 × 127 = 16,129, needing more than 8 bits), which we dequantize back to FP16. The key question: does this round-trip — quantize, multiply, dequantize — preserve the original FP16 result?

FP16 inputs
Xf16 [s, h] and Wf16 [h, o] — original precision
↓ quantize
INT8 inputs
Xi8 [s, h] and Wi8 [h, o] — each value rounded to [-127, 127]
↓ multiply (INT8 × INT8 = INT32)
INT32 output
Ci32 [s, o] — accumulated dot products
↓ dequantize (divide by scaling factors)
FP16 output
Cf16 [s, o] ≈ Xf16 · Wf16
Xf16 Wf16 = Cf16 ≈ Sf16 · Xi8 Wi8

Where Sf16 is a scaling factor that converts the INT32 result back to FP16. The "approximately equals" is the crux — how close is the approximation? That depends on the quantization scheme. A good scheme minimizes the error between Cf16 (the true FP16 matmul) and the dequantized INT8 result. A bad scheme introduces enough error that the model's outputs become garbage.

Why does absmax (symmetric) quantization waste precision for asymmetric distributions like ReLU outputs (which are always non-negative)?

Chapter 2: Absmax Quantization — Deriving It

Let's derive absmax quantization from first principles. We have a floating-point tensor Xf16 with shape [s, h] (sequence length by hidden dimension). We want to map every value into the integer range [-127, +127].

The strategy: find the largest absolute value in the tensor, and scale everything so that value maps to 127 (or -127). All other values are scaled proportionally and rounded to the nearest integer.

Step 1: Compute the scaling factor

Find the infinity norm of the tensor — the maximum absolute value across all elements:

sx = 127 / ||Xf16|| = 127 / maxij(|Xf16ij|)

This is the scaling factor. If the largest value in X is 5.0, then sx = 127 / 5.0 = 25.4. Every value gets multiplied by 25.4, so 5.0 maps to 127, 2.5 maps to ~64, and 0.1 maps to ~3.

Step 2: Quantize

Xi8 = ⌊ sx · Xf16

Where ⌊·⌉ denotes rounding to the nearest integer. That's the entire quantization formula. Multiply by the scaling factor, round to integer, clamp to [-127, 127].

Step 3: Dequantize

To recover the approximate original value, divide by the scaling factor:

Xf16 ≈ Xi8 / sx

Worked example

Suppose we have a small tensor Xf16 = [-0.8, 1.5, 0.3, -2.1, 0.7].

Step by step:
1. max absolute value = |−2.1| = 2.1
2. sx = 127 / 2.1 = 60.48
3. Quantize each value:
  −0.8 × 60.48 = −48.38 → −48
  1.5 × 60.48 = 90.71 → 91
  0.3 × 60.48 = 18.14 → 18
  −2.1 × 60.48 = −127.0 → −127
  0.7 × 60.48 = 42.33 → 42
4. Dequantize: divide by 60.48
  −48 / 60.48 = −0.794 (original: −0.8, error: 0.006)
  91 / 60.48 = 1.504 (original: 1.5, error: 0.004)
  18 / 60.48 = 0.298 (original: 0.3, error: 0.002)
  −127 / 60.48 = −2.099 (original: −2.1, error: 0.001)
  42 / 60.48 = 0.694 (original: 0.7, error: 0.006)

Not bad! The worst error is 0.006. But now imagine what happens if one value in the tensor is an outlier with magnitude 60:

The outlier catastrophe: If X = [-0.8, 1.5, 0.3, -60.0, 0.7], then sx = 127/60 = 2.117. Now:
  −0.8 × 2.117 = −1.69 → −2 (dequantized: −0.945, error: 0.145!)
  1.5 × 2.117 = 3.18 → 3 (dequantized: 1.418, error: 0.082)
  0.3 × 2.117 = 0.64 → 1 (dequantized: 0.473, error: 0.173!)
A single outlier crushes the precision of every other value. Most of the 254 quantization levels are wasted on the empty range between ±2 and ±60.

This is foreshadowing. Remember this: a single large outlier destroys quantization precision for an entire tensor. We'll see that this is exactly what happens in large transformers.

Absmax Quantization Precision

A random tensor of 32 values with a controllable outlier. Watch how one outlier destroys quantization precision for everything else.

Outlier magnitude 3
python
import torch

def absmax_quantize(X_f16):
    """Absmax quantization: FP16 -> INT8"""
    scale = 127.0 / X_f16.abs().max()     # s_x = 127 / ||X||_inf
    X_i8 = (X_f16 * scale).round().clamp(-127, 127).to(torch.int8)
    return X_i8, scale

def absmax_dequantize(X_i8, scale):
    """Dequantize: INT8 -> FP16"""
    return X_i8.float() / scale

# Example: quantize a weight matrix
W = torch.randn(4096, 4096, dtype=torch.float16)
W_i8, scale = absmax_quantize(W)
W_recovered = absmax_dequantize(W_i8, scale)
print(f"Max error: {(W.float() - W_recovered).abs().max():.6f}")
# Typical output: Max error: 0.012
If a tensor has values in the range [-1.0, 1.0] except for one outlier at 50.0, what fraction of the 254 usable INT8 levels (from -127 to +127) are dedicated to representing the "normal" values in [-1, 1]?

Chapter 3: Vector-wise Quantization

In the previous chapter, we used a single scaling factor for the entire tensor. One outlier anywhere ruins precision everywhere. The natural fix: use more scaling factors.

The key insight from Dettmers et al. is to view matrix multiplication as a sequence of independent inner products. Consider multiplying X ∈ Rs×h by W ∈ Rh×o. Each element of the output C[i, j] is the inner product of row i of X with column j of W. These inner products are independent — they don't share any computation.

Key insight: If each inner product is independent, we can use a different scaling factor for each row of X and each column of W. An outlier in row 5 of X only affects the scaling for row 5, not row 6. This is vector-wise quantization.

Three levels of quantization granularity

MethodScaling constants# ConstantsOutlier isolation
Tensor-wiseOne per tensor1 for X, 1 for WNone — one outlier kills everything
Row-wiseOne per row of X, one for all of Ws for X, 1 for WPartially — isolates outliers to their row
Vector-wiseOne per row of X, one per column of Ws for X, o for WBest — each inner product has its own scale

The math

For vector-wise quantization, we assign a scaling constant cxi to each row i of X (computed as 127 / max|X[i, :]|) and a constant cwj to each column j of W (computed as 127 / max|W[:, j]|). The quantized matrix multiplication becomes:

Cf16 ≈ (cx ⊗ cw)−1 · Ci32 = S · (Xi8 · Wi8)

Where cx ∈ Rs is the vector of row-wise scaling constants, cw ∈ Ro is the vector of column-wise constants, and ⊗ denotes the outer product. The dequantization matrix S = (cx ⊗ cw)−1 has shape [s, o] — one dequantization factor per output element.

Worked example

Suppose X is 2×3 and W is 3×2:

Xf16 = [[1.0, -0.5, 0.2], [0.3, 2.0, -0.1]]

Row scaling constants:
cx[0] = 127 / max(1.0, 0.5, 0.2) = 127 / 1.0 = 127.0
cx[1] = 127 / max(0.3, 2.0, 0.1) = 127 / 2.0 = 63.5

Quantize each row independently:
Row 0: [1.0 × 127, -0.5 × 127, 0.2 × 127] = [127, -64, 25]
Row 1: [0.3 × 63.5, 2.0 × 63.5, -0.1 × 63.5] = [19, 127, -6]

The key difference from tensor-wise: row 1 has its own scale factor. The value 2.0 maps to 127 (full range), not to 25 (as it would with a tensor-wise scale of 127/max(all) = 127/2.0 applied uniformly).

Similarly, W gets one scaling constant per column.
python
import torch

def vector_wise_quantize(X, W):
    """Vector-wise quantization for matrix multiplication."""
    # Row-wise scaling for X: one constant per row
    cx = 127.0 / X.abs().amax(dim=1, keepdim=True).clamp(min=1e-8)  # shape [s, 1]
    X_i8 = (X * cx).round().clamp(-127, 127).to(torch.int8)

    # Column-wise scaling for W: one constant per column
    cw = 127.0 / W.abs().amax(dim=0, keepdim=True).clamp(min=1e-8)  # shape [1, o]
    W_i8 = (W * cw).round().clamp(-127, 127).to(torch.int8)

    # INT8 matmul (accumulate in INT32)
    C_i32 = X_i8.int() @ W_i8.int()  # shape [s, o]

    # Dequantize: outer product of inverse scaling constants
    S = 1.0 / (cx @ cw)  # shape [s, o] — one factor per output
    C_f16 = (C_i32.float() * S).half()

    return C_f16

The paper shows that vector-wise quantization preserves perplexity up to about 2.7B parameters. Beyond that, even vector-wise quantization starts to degrade — because of the emergent outlier features we'll see in Chapter 4.

Method125M1.3B2.7B6.7B13B
FP32 baseline25.6515.9114.4313.3012.45
INT8 absmax tensor-wise87.7616.5515.1114.5919.08
INT8 absmax vector-wise35.8416.8214.9814.1316.48
INT8 zeropoint vector-wise25.7215.9414.3613.3813.47
LLM.int8()25.8315.9314.4413.2412.45

Study this table carefully. Look at the 13B column. Tensor-wise absmax: 19.08 — worse than the 6.7B model. Vector-wise absmax: 16.48 — still terrible. Even zeropoint: 13.47, slightly degraded. Only LLM.int8() achieves 12.45, matching the FP32 baseline exactly. Something catastrophic happens at the 6.7B+ scale that only LLM.int8() can handle.

In vector-wise quantization, how many scaling constants are computed for the hidden state matrix X of shape [s, h]?

Chapter 4: Emergent Outlier Features

This is the paper's most important contribution: the discovery and characterization of emergent outlier features in large transformer hidden states. This section is fascinating because it reveals something genuinely surprising about how large language models work internally.

Dettmers et al. examined transformer hidden states X ∈ Rs×h across models from 125M to 13B parameters. They tracked individual feature dimensions hi (columns of the hidden state matrix) across all layers, looking for values with magnitude ≥ 6.0.

What they found

At small scales (125M-1.3B parameters), outliers are rare and sporadic — maybe 1-2 dimensions with large values, appearing in about 25% of layers, affecting 6-18% of sequence positions. Quantization handles these fine because they're not systematic enough to dominate the scaling factor.

At medium scales (2.7B-6B), outliers become more common — 5-6 dimensions, appearing in 50-62% of layers. Quantization starts to struggle.

The phase transition at 6.7B parameters: Something dramatic happens. At exactly 6.7B parameters (measured in the fairseq model family), outliers suddenly appear in 100% of all transformer layers and 75% of all sequence positions. The outlier magnitudes jump from ~20 to ~40. And these outliers are concentrated in just 6 feature dimensions out of thousands.

For a 6.7B model with sequence length 2048, this means approximately 150,000 outlier values per sequence — but all concentrated in just 6 out of ~4096 hidden dimensions.
ModelParamsPPL# Outlier dims% Layers% Seq dimsOutlier magnitude (quartiles)
GPT-2117M33.5125%6%(-8, -7, -6)
GPT-2345M26.0229%18%(6, 7, 8)
GPT-21.5B21.0241%35%(-11, -9, -7)
FSEQ2.7B14.4552%18%(-25, -16, -9)
GPT-J6.0B13.8662%28%(-21, -17, -14)
FSEQ6.7B13.36100%75%(-44, -40, -35)
FSEQ13B12.57100%73%(-63, -58, -45)

Look at the magnitude column. At 6.7B, the outlier values reach -44. At 13B, they reach -63. Normal feature values in a transformer hidden state are typically in the range [-3.5, +3.5]. These outliers are 10-20x larger than everything else.

Why do outliers matter for the model?

Dettmers et al. tested what happens when you remove outlier features by setting them to zero before they enter the attention layers. The results are dramatic:

Removing just 6-7 outlier dimensions (out of ~4096):
• Mean top-1 softmax probability drops from ~40% to ~20%
• Validation perplexity increases by 600-1000%

Removing 7 random (non-outlier) dimensions instead:
• Top-1 probability decreases by only 0.02-0.3%
• Perplexity increases by only ~0.1%

These features make up only about 0.1% of all input dimensions, but they carry an outsized fraction of the model's "decision-making" information. They dominate the attention softmax, steering which tokens attend to which. The model has learned to encode critical information in just a handful of dimensions with extreme magnitudes.

Why does this happen?

The paper observes that emergence correlates more closely with perplexity than raw model size — a better-trained smaller model could potentially trigger the phase shift. The outliers appear to be a strategy the model develops to create sharp attention patterns. When a dimension has magnitude 60 and everything else is < 3, the softmax becomes extremely peaked on the tokens with the large outlier values. This is the mechanism the model uses to "pay attention" to specific positions.

The outliers are also highly asymmetric — mostly one-sided (either all positive or all negative across the sequence dimension). This is important: it explains why zeropoint (asymmetric) quantization outperforms absmax (symmetric) quantization for models at the 6.7B+ scale.

Outlier Emergence Phase Transition

This visualization shows how outlier features progressively take over transformer layers as model scale increases. Hover/click a model size to see the pattern.

6.7B: phase transition
What makes the outlier emergence at 6.7B parameters a "phase transition" rather than a gradual change?

Chapter 5: Why Outliers Break INT8

Now we understand the two pieces of the puzzle: quantization schemes (Chapter 2-3) and emergent outliers (Chapter 4). Let's put them together and see exactly why standard quantization fails at scale.

The geometry of the problem

Remember that vector-wise quantization assigns one scaling constant per row of the hidden state X. A row of X is one token's hidden state — all h feature dimensions for a single sequence position. Outliers occur in specific feature columns — the same handful of hidden dimensions across all tokens.

The dimensional mismatch: Vector-wise quantization normalizes by row. Outliers live in columns. A row that contains an outlier of magnitude 60 alongside normal values of magnitude 2 will have its scaling factor dominated by 60, crushing the precision of all the normal values in that row. And since outliers occur in 75% of all sequence positions after the phase transition, 75% of all rows are contaminated.

Let's trace through a concrete example. Consider a hidden state with h=6 dimensions, where dimension 3 is an outlier dimension:

Example row from X at 6.7B scale:
X[token_42, :] = [0.5, -1.2, 0.8, -44.0, 0.3, -0.7]

Row-wise absmax:
s = 127 / 44.0 = 2.886
Quantized: [1, -3, 2, -127, 1, -2]

Dequantized:
[0.346, -1.040, 0.693, -44.0, 0.346, -0.693]

Errors:
dim 0: |0.5 - 0.346| = 0.154 (31% relative error!)
dim 1: |1.2 - 1.040| = 0.160 (13% error)
dim 3: |44.0 - 44.0| = 0.000 (perfect — it's at the boundary)

The outlier is quantized perfectly, but the normal values lose massive precision. With 5 out of 6 dimensions having ~15-30% relative error, the downstream computation is severely corrupted.

This is why the perplexity table shows degradation at 6.7B+. Every row that contains an outlier (75% of rows) has its normal-valued features crushed. These normal values — 99.9% of the tensor — carry the bulk of the semantic information. Destroying their precision destroys the model's ability to distinguish between similar tokens.

Why zeropoint helps but isn't enough

Recall that outlier features are almost always one-sided (all negative or all positive). Absmax quantization maps [-max, +max] to [-127, +127], wasting half the range when values are one-sided. Zeropoint quantization uses the asymmetric range [min, max] to [-127, +127], utilizing all 254 levels.

This explains why zeropoint outperforms absmax in the perplexity table at 6.7B (13.38 vs 14.13 for vector-wise). But by 13B, even zeropoint fails (13.47 vs 12.45 baseline) — the outlier magnitudes have grown so large (-63) that no amount of clever scaling within a single row can preserve precision for the normal values alongside them.

The fundamental impossibility

The fundamental impossibility

The problem is mathematically inescapable as long as we try to represent both outliers and normal values with the same 8-bit integers in the same row. We have 254 quantization levels. The range of values in a contaminated row is [-63, +3.5] or so. That's a span of 66.5. Each quantization step covers 66.5/254 = 0.26. For normal values that differ by 0.1 or less, they all quantize to the same integer. Information is permanently lost.

A concrete calculation: Consider a row with values [0.15, -0.23, 0.08, -58.0, -0.31, 0.19].

With absmax: scale = 127/58 = 2.19
• 0.15 × 2.19 = 0.33 → rounds to 0
• -0.23 × 2.19 = -0.50 → rounds to -1
• 0.08 × 2.19 = 0.18 → rounds to 0

Values 0.15 and 0.08 are identical after quantization — both are 0. The distinction between them is erased. Across an entire hidden state with 4096 dimensions and 75% of rows contaminated, this information loss is catastrophic.

Why column-wise quantization isn't the answer

You might wonder: why not quantize by column instead of by row? Then each column gets its own scaling constant, and the outlier columns get their own large scale. The problem is that column-wise quantization for W means row-wise for X and vice versa — you can't independently optimize both. The standard inner-product structure of matrix multiplication requires one axis for X and the orthogonal axis for W. Vector-wise quantization already uses the best combination (rows of X, columns of W), but the outliers are in the columns of X — the axis that doesn't get its own scaling constant.

The solution? Don't try. Separate the outliers from the normal values and handle them differently. This is exactly what mixed-precision decomposition does.

Why can't vector-wise (row-wise for X) quantization handle outlier features, even though it uses a separate scaling constant per row?

Chapter 6: Mixed-Precision Decomposition

The insight is elegant in its simplicity: if outlier features live in a few columns and destroy quantization for everything they touch, just remove them before quantizing. Handle the outlier columns in full 16-bit precision. Quantize everything else to INT8. Then combine the results.

The algorithm step by step

Given input hidden states Xf16 ∈ Rs×h and weights Wf16 ∈ Rh×o:

1. Detect outlier columns
Scan Xf16 for columns (feature dimensions hi) where any value has magnitude ≥ α = 6.0. Collect these indices into the outlier set O.
2. Decompose X and W
Extract the outlier columns from X → Xoutlier ∈ Rs×|O|. Extract the corresponding rows from W → Woutlier ∈ R|O|×o. The remaining columns/rows form Xregular and Wregular.
3. Multiply outliers in FP16
Coutlier = Xoutlier · Woutlier (standard FP16 matmul, no quantization)
4. Quantize and multiply regular values in INT8
Apply vector-wise absmax quantization to Xregular and Wregular. Perform INT8 matmul. Dequantize the result.
5. Combine
Cf16 = Coutlier + Cregular (add in FP16)

The formula

In Einstein notation where indices are superscripts and h indexes the hidden dimension:

Cf16 ≈ ∑h ∈ O Xf16h Wf16h + Sf16 · ∑h ∉ O Xi8h Wi8h

The first sum handles outlier dimensions in FP16. The second sum handles everything else in INT8 with vector-wise quantization (Sf16 is the dequantization scaling matrix from the outer product of row/column constants).

Why the memory cost is negligible

The critical fact: for transformers up to 13B parameters, |O| ≤ 7. Out of a hidden dimension of ~5120 at the 13B scale, only 7 dimensions are outliers. That's 7/5120 = 0.14% of the dimensions handled in FP16. The remaining 99.86% are in INT8. The extra memory for storing the FP16 outlier columns is negligible — about 0.1% additional memory on top of the INT8 baseline.

python
import torch

def llm_int8_matmul(X_f16, W_f16, threshold=6.0):
    """LLM.int8() mixed-precision matrix multiplication."""
    # Step 1: Find outlier columns in X
    outlier_mask = X_f16.abs().max(dim=0).values > threshold  # [h]
    outlier_cols = outlier_mask.nonzero().squeeze(-1)         # indices
    regular_cols = (~outlier_mask).nonzero().squeeze(-1)      # indices

    # Step 2: Decompose
    X_outlier = X_f16[:, outlier_cols]          # [s, |O|]
    W_outlier = W_f16[outlier_cols, :]          # [|O|, o]
    X_regular = X_f16[:, regular_cols]          # [s, h-|O|]
    W_regular = W_f16[regular_cols, :]          # [h-|O|, o]

    # Step 3: Outliers in FP16
    C_outlier = X_outlier @ W_outlier           # [s, o] in FP16

    # Step 4: Regular values — vector-wise INT8 quantization
    # Row-wise scaling for X_regular
    cx = 127.0 / X_regular.abs().amax(dim=1, keepdim=True).clamp(min=1e-8)
    X_i8 = (X_regular * cx).round().clamp(-127, 127).to(torch.int8)

    # Column-wise scaling for W_regular
    cw = 127.0 / W_regular.abs().amax(dim=0, keepdim=True).clamp(min=1e-8)
    W_i8 = (W_regular * cw).round().clamp(-127, 127).to(torch.int8)

    # INT8 matmul + dequantize
    C_i32 = X_i8.int() @ W_i8.int()
    S = 1.0 / (cx @ cw)                       # outer product dequant
    C_regular = (C_i32.float() * S).half()      # [s, o] back to FP16

    # Step 5: Combine
    return C_outlier + C_regular

# Usage: drop-in replacement for any linear layer
X = torch.randn(2048, 4096, dtype=torch.float16, device='cuda')
W = torch.randn(4096, 4096, dtype=torch.float16, device='cuda')
output = llm_int8_matmul(X, W)
The α = 6.0 threshold: The paper finds that treating any feature with magnitude ≥ 6.0 as an outlier is sufficient to eliminate perplexity degradation. This threshold was determined empirically: below 6.0, too many non-critical features are separated (unnecessary overhead); above 6.0, some critical outliers get quantized to INT8 (precision loss). The value 6.0 sits at the sweet spot — roughly 2x the typical feature range of [-3.5, 3.5].
In the mixed-precision decomposition, what determines whether a feature dimension is treated as an "outlier" column?

Chapter 7: The Full LLM.int8() Algorithm

Let's put it all together. LLM.int8() is the combination of two techniques: vector-wise absmax quantization and mixed-precision decomposition. Together, they form a complete procedure that converts a 16-bit transformer checkpoint to 8-bit at inference time with zero performance degradation.

The complete data flow

For every linear layer in the transformer (feed-forward up-projection, down-projection, attention Q/K/V/O projections), the following happens at inference time:

Input: Xf16 [s, h] and Wf16 [h, o]
Hidden states from the previous layer (FP16) and the weight matrix (originally stored in FP16, can be pre-converted to INT8 for weights).
Outlier detection
Find O = {i : maxs|X[s, i]| ≥ 6.0}. Typically |O| ≤ 7.
Decompose
Split X into XO [s, |O|] and XR [s, h−|O|]. Split W into WO [|O|, o] and WR [h−|O|, o].
↓ parallel
FP16 path: CO = XO WO
0.1% of dimensions. Full precision matmul. No quantization error.
+
INT8 path: quantize XR (row-wise), WR (col-wise), matmul, dequantize
99.9% of dimensions. Vector-wise absmax quantization. Minimal error since outliers are removed.
Output: Cf16 = CO + dequant(CR)
Accumulate in FP16. Pass to next layer.

What layers get quantized?

LLM.int8() targets the matrix multiplications in:

LayerOperationQuantized?
FFN up-projectionX · WupYes (INT8 + mixed-precision)
FFN down-projectionX · WdownYes (INT8 + mixed-precision)
Attention Q/K/V projectionsX · WQ, X · WK, X · WVYes (INT8 + mixed-precision)
Attention output projectionX · WOYes (INT8 + mixed-precision)
Attention scores (Q · KT)Softmax(QKT/√d)No — no parameters, pure activation
EmbeddingsLookup tableNo — not a matmul
LayerNormNormalizationNo — tiny parameter count

This covers 95% of all model parameters and 65-85% of all computation. The remaining operations (embeddings, normalization, attention scores) stay in FP16.

Memory savings

Since 99.9% of values are stored in INT8 (1 byte) instead of FP16 (2 bytes), the memory savings are approximately 2x for the quantized layers. For BLOOM-176B, the total memory reduction is 1.96x — from ~352 GB to ~180 GB. The slight shortfall from a perfect 2x comes from the FP16 outlier columns, the scaling constants, and the non-quantized layers (embeddings, norms).

Runtime performance

The quantization and decomposition overhead matters for runtime speed. For models smaller than 6.7B, LLM.int8() is actually slower than FP16 due to the overhead of outlier detection and two separate matmuls. But for large models (6.7B+), INT8 tensor cores provide enough speedup to overcome the overhead:

Model sizeFP16 baselineVector-wise INT8 (no decomp)LLM.int8()
2.7B1.00x0.94x0.64x (slower)
6.7B1.00x1.18x0.86x
13B1.00x1.59x1.22x (faster)
175B1.00x2.00x1.81x (faster)

For BLOOM-176B end-to-end inference, LLM.int8() on 3x A100 80GB is comparable to FP16 on 8x A100 80GB — same speed, but using fewer than half the GPUs.

Why doesn't LLM.int8() quantize the attention score computation (softmax(QK^T / sqrt(d)))?

Chapter 8: Quantization Explorer

Now let's see the full LLM.int8() algorithm in action. This interactive simulation lets you experience the mixed-precision decomposition on a real(istic) hidden state matrix. You can control the outlier magnitude, the number of outlier dimensions, and see how the quantization error changes with and without mixed-precision decomposition.

LLM.int8() Mixed-Precision Decomposition

A simulated hidden state matrix X [8 tokens, 16 dims]. Outlier columns are highlighted. Compare standard INT8 quantization vs. LLM.int8() decomposition. Drag the sliders to change outlier properties.

Outlier magnitude 40
Outlier columns 2
Mixed-precision ON — outliers in FP16

Notice how in standard INT8 mode, increasing the outlier magnitude causes the error for normal values to spike dramatically. Switch to LLM.int8() mode and the normal-value error drops to near-zero because outliers are handled separately in FP16.

What happens inside the simulation

The simulation does exactly what the paper's algorithm does:

Standard INT8: All values are quantized with a single row-wise scaling factor. The outlier dominates the scale, crushing precision for normal values. Try setting the outlier magnitude to 70 and observe the red error bars — almost every token row has substantial error.

LLM.int8(): Outlier columns are extracted and multiplied in FP16 (zero error). The remaining values are quantized with their own row-wise scaling factor — now undistorted by outliers. The two results are summed. Total error is orders of magnitude lower. The dashed red lines show where the standard INT8 error would be — the improvement is dramatic.

Experiments to try

1. Increase outlier magnitude from 5 to 70. In standard mode, watch the error grow linearly. In LLM.int8() mode, the error for normal values stays flat — because the outlier magnitude doesn't affect the INT8 quantization path at all.

2. Set outlier columns to 0. Both modes become identical — there's nothing to decompose. This is why LLM.int8() has no benefit for models below 6.7B (few or no outliers).

3. Set outlier columns to 4. More outlier columns means more values handled in FP16, but the paper shows that even at 13B scale, there are at most 7 outlier dimensions. The overhead of the FP16 path stays negligible.

The key takeaway from playing with this simulation: The error from standard INT8 grows linearly with outlier magnitude. The error from LLM.int8() stays approximately constant regardless of outlier magnitude, because the outliers never enter the quantization path. This is why LLM.int8() scales where other methods don't — the outliers get worse at scale, but LLM.int8() is invariant to outlier magnitude.
In the LLM.int8() decomposition, when you increase the outlier magnitude from 40 to 70, what happens to the quantization error of the REGULAR (non-outlier) features?

Chapter 9: Connections

LLM.int8() was published in August 2022 at NeurIPS and immediately changed the landscape of LLM deployment. Let's place it in the broader context of model compression and efficient inference.

What this paper established

Three lasting contributions:
1. The first zero-degradation quantization at 175B scale. Prior work topped out at 350M parameters. LLM.int8() jumped 500x in model size with no loss.
2. The discovery of emergent outlier features. The phase transition at 6.7B — outliers appearing in all layers, all tokens, concentrated in a handful of dimensions — was a genuine surprise. It revealed something fundamental about how large transformers organize information.
3. The bitsandbytes library. The open-source implementation was integrated directly into Hugging Face Transformers, making INT8 inference available to millions of users with a one-line code change.

What came next

MethodYearKey ideaBitsRelationship to LLM.int8()
GPTQ2022Layer-wise post-training quantization using Hessian info4-bitComplementary — pushes to fewer bits with calibration data
QLoRA20234-bit NormalFloat + LoRA for finetuning4-bitBy Dettmers — extends bitsandbytes to training
AWQ2023Activation-aware weight quantization4-bitBuilds on the outlier insight — protects important weights
SqueezeLLM2023Non-uniform quantization with sparse outliers3-4-bitDirectly extends the outlier decomposition idea
FP82023+Native 8-bit float hardware support (H100+)8-bit floatMay eventually supersede INT8 quantization

Limitations acknowledged by the authors

INT8 only. The paper does not study FP8 data types. Since 2022, NVIDIA H100 and later GPUs added native FP8 support, which may provide better precision-performance tradeoffs than INT8 for the same bit width.

Inference only. LLM.int8() is designed for inference. The authors' initial experiments with INT8 training showed degradation for attention projections at scale (Appendix E of the paper). Training requires different techniques — this gap was partially filled by QLoRA in 2023.

Attention scores not quantized. The QKT attention computation stays in FP16. For long-context models where attention memory dominates, this is a significant limitation.

Overhead at small scale. For models under 6.7B, LLM.int8() is slower than FP16. But models under 6.7B generally fit in GPU memory without quantization, so this is rarely a practical concern.

The deeper insight

Perhaps the most lasting impact of this paper isn't the quantization algorithm itself — it's the discovery of emergent features. The finding that large transformers spontaneously develop a handful of extreme-magnitude dimensions, that these dimensions are critical for attention, and that their emergence follows a phase transition — this tells us something fundamental about how neural networks organize information at scale. Subsequent work on model interpretability has built on this observation, studying what these outlier dimensions encode and why they emerge.

From the paper: "While up to 150k outliers exist per 2048 token sequence for a 13B model, these outlier features are highly systematic and only represent at most 7 unique feature dimensions. Insights from this analysis were critical to developing mixed-precision decomposition."

Using LLM.int8() today

python
# One line to load any model in 8-bit via bitsandbytes + HuggingFace
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    load_in_8bit=True,          # <-- that's it
    device_map="auto"
)
# Model is now ~35 GB instead of ~140 GB
# Zero performance degradation
What is the most lasting scientific contribution of the LLM.int8() paper, beyond the practical quantization algorithm?