LLM.int8() — Veanors

Chapter 0: The Memory Wall

You've trained a 175-billion-parameter language model. It took months of compute on thousands of GPUs. Now you want to use it — run inference, answer questions, generate text. One problem: the model's weights alone consume 350 GB of memory in FP16 (16-bit floating point). A single NVIDIA A100 GPU has 80 GB. You need at least 5 GPUs just to hold the model, before you've processed a single token.

This is the memory wall. For large language models at and beyond 6.7B parameters, the feed-forward and attention projection layers — the matrix multiplications — are responsible for 95% of all parameters and 65-85% of all computation. The bottleneck is not compute speed; it's how much model you can fit in memory.

What if you could represent each weight with 8 bits instead of 16? You'd cut memory in half. A 175B model would shrink from 350 GB to 175 GB — suddenly fitting on a single server with consumer GPUs. The arithmetic would be faster too: INT8 tensor cores can do roughly twice the throughput of FP16.

The catch: Prior to this paper, every attempt to quantize transformers larger than 350M parameters to 8-bit caused noticeable performance degradation. Methods that worked beautifully on BERT-sized models (350M params) simply broke when applied to models with billions of parameters. Nobody understood why.

Let's see the scale of the problem. Here's what each hardware setup can run in 16-bit vs. 8-bit precision:

Hardware	GPU Memory	Largest Model (16-bit)	Largest Model (8-bit)
8x A100 80GB	640 GB total	OPT-175B / BLOOM	OPT-175B / BLOOM
8x A100 40GB	320 GB total	OPT-66B	OPT-175B / BLOOM
8x RTX 3090	192 GB total	OPT-66B	OPT-175B / BLOOM
4x RTX 3090	96 GB total	OPT-30B	OPT-66B
Colab Pro	15 GB	GPT-J-6B	OPT-13B
Free Colab	12 GB	GPT-2 1.3B	T0/T5-11B

Look at that last row. With 8-bit quantization, a free Google Colab notebook — something any student can access — can run an 11B parameter model. In 16-bit, you're limited to 1.3B. That's a nearly 10x improvement in accessible model size.

Memory Footprint by Precision

Each bar shows the GPU memory required to hold model weights. Click a model size to see the comparison.

6.7B model selected

Where the memory goes

Let's do the arithmetic. A 175B-parameter model stores each parameter as a 16-bit float = 2 bytes. Total: 175 × 10⁹ × 2 bytes = 350 GB. But that's just the weights. During inference, you also need memory for:

Component	What it stores	Approximate size (175B)
Model weights	W_Q, W_K, W_V, W_O, W_up, W_down for each layer	350 GB (FP16) / 175 GB (INT8)
KV cache	Key and value tensors for all past tokens	Grows with sequence length
Activations	Intermediate hidden states during forward pass	~1-5 GB (depends on batch size)
CUDA overhead	Framework buffers, memory fragmentation	~2-5 GB

The weights dominate. If you can halve the weight memory, you can fit the model on half the GPUs — or fit a model twice as large on the same hardware.

Dettmers et al. solved this problem. Their method, LLM.int8(), quantizes transformer weights to 8-bit integers with zero performance degradation — even at 175B parameters. The key insight? They discovered that quantization fails at scale because of a handful of emergent outlier features — values 20x larger than everything else — that appear in specific hidden dimensions once models exceed about 6.7B parameters. By isolating these outliers and handling them in 16-bit while quantizing everything else to 8-bit, they achieved lossless compression.

This wasn't just an engineering improvement. It was a scientific discovery about how transformers organize information at scale — and that discovery is what made the engineering possible.

Let's understand how, starting with the basics of quantization.

Why is the memory wall the primary bottleneck for LLM inference, rather than compute speed?

Because the feed-forward and attention projection layers account for 95% of all parameters, and you must fit all weights in GPU memory before you can process even one token — a 175B model needs 350 GB in FP16, exceeding any single GPU Because GPUs are too slow to multiply large matrices Because the attention mechanism requires quadratic memory in sequence length

Chapter 1: Quantization 101

Before we can understand LLM.int8(), we need to understand what quantization means and why it's hard. The core idea is simple: represent numbers with fewer bits.

A 16-bit floating point number (FP16) can represent values from approximately -65,504 to +65,504 with varying precision. It uses 1 sign bit, 5 exponent bits, and 10 mantissa bits. This gives you fine-grained precision near zero and coarser precision for large values — exactly what you want for neural network weights, which tend to cluster near zero.

An 8-bit signed integer (INT8) can represent exactly 256 values: the integers from -128 to +127. That's it. No fractions, no exponents. Just 256 evenly spaced integers.

To feel the difference: FP16 can distinguish between 1.0000 and 1.0010. INT8 can only distinguish between 1 and 2. If a weight has value 1.3, INT8 must round it to either 1 or 2 — losing the fractional part. This rounding error is quantization error, and the entire challenge of model quantization is keeping it small enough that the model doesn't notice.

The quantization problem in one sentence: How do you map a continuous range of floating-point values (potentially thousands of distinct values in a weight tensor) down to just 256 discrete integers, while losing as little information as possible?

Think of it like compressing a high-resolution photograph into a 256-color palette. You need a mapping function that assigns each original color to its closest palette entry, and a reverse mapping (dequantization) that converts back. The quality of the compressed image depends entirely on how cleverly you choose those 256 colors.

For neural networks, there are two main approaches:

Symmetric (Absmax)

Map the range [-max, +max] to [-127, +127]. Zero maps to zero. Simple and fast, but wastes half the range if values are all positive.

↓ vs

Asymmetric (Zeropoint)

Map the range [min, max] to [-127, +127]. Uses the full INT8 range. More precise for skewed distributions, but requires extra bookkeeping.

Let's build these from scratch.

Why does quantization matter for matrix multiplication?

A transformer's core operation is matrix multiplication: hidden states X (shape [sequence_length, hidden_dim]) multiplied by weight matrices W (shape [hidden_dim, output_dim]). In FP16, each element of X and W is a 16-bit float. The multiplication produces a 32-bit output that's then cast back to FP16.

If we quantize both X and W to INT8, we halve the memory for storing weights and can use INT8 tensor cores, which are roughly 2x faster than FP16 cores on modern GPUs. The INT8 multiplication produces INT32 outputs (because two 8-bit integers multiplied can be up to 127 × 127 = 16,129, needing more than 8 bits), which we dequantize back to FP16. The key question: does this round-trip — quantize, multiply, dequantize — preserve the original FP16 result?

FP16 inputs

X_f16 [s, h] and W_f16 [h, o] — original precision

↓ quantize

INT8 inputs

X_i8 [s, h] and W_i8 [h, o] — each value rounded to [-127, 127]

↓ multiply (INT8 × INT8 = INT32)

INT32 output

C_i32 [s, o] — accumulated dot products

↓ dequantize (divide by scaling factors)

FP16 output

C_f16 [s, o] ≈ X_f16 · W_f16

X_f16 W_f16 = C_f16 ≈ S_f16 · X_i8 W_i8

Where S_f16 is a scaling factor that converts the INT32 result back to FP16. The "approximately equals" is the crux — how close is the approximation? That depends on the quantization scheme. A good scheme minimizes the error between C_f16 (the true FP16 matmul) and the dequantized INT8 result. A bad scheme introduces enough error that the model's outputs become garbage.

Why does absmax (symmetric) quantization waste precision for asymmetric distributions like ReLU outputs (which are always non-negative)?

Because absmax is slower to compute than zeropoint Because absmax maps [-max, +max] to [-127, +127] symmetrically around zero — if all values are positive, the entire negative half of the INT8 range [-127, 0) goes unused, effectively quantizing with only 127 levels instead of 254 Because ReLU outputs are always large in magnitude

Chapter 2: Absmax Quantization — Deriving It

Let's derive absmax quantization from first principles. We have a floating-point tensor X_f16 with shape [s, h] (sequence length by hidden dimension). We want to map every value into the integer range [-127, +127].

The strategy: find the largest absolute value in the tensor, and scale everything so that value maps to 127 (or -127). All other values are scaled proportionally and rounded to the nearest integer.

Step 1: Compute the scaling factor

Find the infinity norm of the tensor — the maximum absolute value across all elements:

s_x = 127 / ||X_f16||_∞ = 127 / max_ij(|X_f16^ij|)

This is the scaling factor. If the largest value in X is 5.0, then s_x = 127 / 5.0 = 25.4. Every value gets multiplied by 25.4, so 5.0 maps to 127, 2.5 maps to ~64, and 0.1 maps to ~3.

Step 2: Quantize

X_i8 = ⌊ s_x · X_f16 ⌉

Where ⌊·⌉ denotes rounding to the nearest integer. That's the entire quantization formula. Multiply by the scaling factor, round to integer, clamp to [-127, 127].

Step 3: Dequantize

To recover the approximate original value, divide by the scaling factor:

X_f16 ≈ X_i8 / s_x

Worked example

Suppose we have a small tensor X_f16 = [-0.8, 1.5, 0.3, -2.1, 0.7].

Step by step:
1. max absolute value = |−2.1| = 2.1
2. s_x = 127 / 2.1 = 60.48
3. Quantize each value:
  −0.8 × 60.48 = −48.38 → −48
  1.5 × 60.48 = 90.71 → 91
  0.3 × 60.48 = 18.14 → 18
  −2.1 × 60.48 = −127.0 → −127
  0.7 × 60.48 = 42.33 → 42
4. Dequantize: divide by 60.48
  −48 / 60.48 = −0.794 (original: −0.8, error: 0.006)
  91 / 60.48 = 1.504 (original: 1.5, error: 0.004)
  18 / 60.48 = 0.298 (original: 0.3, error: 0.002)
  −127 / 60.48 = −2.099 (original: −2.1, error: 0.001)
  42 / 60.48 = 0.694 (original: 0.7, error: 0.006)

Not bad! The worst error is 0.006. But now imagine what happens if one value in the tensor is an outlier with magnitude 60:

The outlier catastrophe: If X = [-0.8, 1.5, 0.3, -60.0, 0.7], then s_x = 127/60 = 2.117. Now:
  −0.8 × 2.117 = −1.69 → −2 (dequantized: −0.945, error: 0.145!)
  1.5 × 2.117 = 3.18 → 3 (dequantized: 1.418, error: 0.082)
  0.3 × 2.117 = 0.64 → 1 (dequantized: 0.473, error: 0.173!)
A single outlier crushes the precision of every other value. Most of the 254 quantization levels are wasted on the empty range between ±2 and ±60.

This is foreshadowing. Remember this: a single large outlier destroys quantization precision for an entire tensor. We'll see that this is exactly what happens in large transformers.

Absmax Quantization Precision

A random tensor of 32 values with a controllable outlier. Watch how one outlier destroys quantization precision for everything else.

Outlier magnitude 3

python
import torch

def absmax_quantize(X_f16):
    """Absmax quantization: FP16 -> INT8"""
    scale = 127.0 / X_f16.abs().max()     # s_x = 127 / ||X||_inf
    X_i8 = (X_f16 * scale).round().clamp(-127, 127).to(torch.int8)
    return X_i8, scale

def absmax_dequantize(X_i8, scale):
    """Dequantize: INT8 -> FP16"""
    return X_i8.float() / scale

# Example: quantize a weight matrix
W = torch.randn(4096, 4096, dtype=torch.float16)
W_i8, scale = absmax_quantize(W)
W_recovered = absmax_dequantize(W_i8, scale)
print(f"Max error: {(W.float() - W_recovered).abs().max():.6f}")
# Typical output: Max error: 0.012

If a tensor has values in the range [-1.0, 1.0] except for one outlier at 50.0, what fraction of the 254 usable INT8 levels (from -127 to +127) are dedicated to representing the "normal" values in [-1, 1]?

About 1/50th = 2%. The scale factor becomes 127/50 = 2.54, so values in [-1, 1] map to integers in [-3, +3] — only 5 out of 254 levels, leaving the vast majority of levels wasted on the empty range between 3 and 127 About 50%. Half the levels go to positive, half to negative About 100%. The outlier doesn't affect the other values

Chapter 3: Vector-wise Quantization

In the previous chapter, we used a single scaling factor for the entire tensor. One outlier anywhere ruins precision everywhere. The natural fix: use more scaling factors.

The key insight from Dettmers et al. is to view matrix multiplication as a sequence of independent inner products. Consider multiplying X ∈ R^s×h by W ∈ R^h×o. Each element of the output C[i, j] is the inner product of row i of X with column j of W. These inner products are independent — they don't share any computation.

Key insight: If each inner product is independent, we can use a different scaling factor for each row of X and each column of W. An outlier in row 5 of X only affects the scaling for row 5, not row 6. This is vector-wise quantization.

Three levels of quantization granularity

Method	Scaling constants	# Constants	Outlier isolation
Tensor-wise	One per tensor	1 for X, 1 for W	None — one outlier kills everything
Row-wise	One per row of X, one for all of W	s for X, 1 for W	Partially — isolates outliers to their row
Vector-wise	One per row of X, one per column of W	s for X, o for W	Best — each inner product has its own scale

The math

For vector-wise quantization, we assign a scaling constant c_xⁱ to each row i of X (computed as 127 / max|X[i, :]|) and a constant c_w^j to each column j of W (computed as 127 / max|W[:, j]|). The quantized matrix multiplication becomes:

C_f16 ≈ (c_x ⊗ c_w)⁻¹ · C_i32 = S · (X_i8 · W_i8)

Where c_x ∈ R^s is the vector of row-wise scaling constants, c_w ∈ R^o is the vector of column-wise constants, and ⊗ denotes the outer product. The dequantization matrix S = (c_x ⊗ c_w)⁻¹ has shape [s, o] — one dequantization factor per output element.

Worked example

Suppose X is 2×3 and W is 3×2:

X_f16 = [[1.0, -0.5, 0.2], [0.3, 2.0, -0.1]]

Row scaling constants:
c_x[0] = 127 / max(1.0, 0.5, 0.2) = 127 / 1.0 = 127.0
c_x[1] = 127 / max(0.3, 2.0, 0.1) = 127 / 2.0 = 63.5

Quantize each row independently:
Row 0: [1.0 × 127, -0.5 × 127, 0.2 × 127] = [127, -64, 25]
Row 1: [0.3 × 63.5, 2.0 × 63.5, -0.1 × 63.5] = [19, 127, -6]

The key difference from tensor-wise: row 1 has its own scale factor. The value 2.0 maps to 127 (full range), not to 25 (as it would with a tensor-wise scale of 127/max(all) = 127/2.0 applied uniformly).

Similarly, W gets one scaling constant per column.

python
import torch

def vector_wise_quantize(X, W):
    """Vector-wise quantization for matrix multiplication."""
    # Row-wise scaling for X: one constant per row
    cx = 127.0 / X.abs().amax(dim=1, keepdim=True).clamp(min=1e-8)  # shape [s, 1]
    X_i8 = (X * cx).round().clamp(-127, 127).to(torch.int8)

    # Column-wise scaling for W: one constant per column
    cw = 127.0 / W.abs().amax(dim=0, keepdim=True).clamp(min=1e-8)  # shape [1, o]
    W_i8 = (W * cw).round().clamp(-127, 127).to(torch.int8)

    # INT8 matmul (accumulate in INT32)
    C_i32 = X_i8.int() @ W_i8.int()  # shape [s, o]

    # Dequantize: outer product of inverse scaling constants
    S = 1.0 / (cx @ cw)  # shape [s, o] — one factor per output
    C_f16 = (C_i32.float() * S).half()

    return C_f16

The paper shows that vector-wise quantization preserves perplexity up to about 2.7B parameters. Beyond that, even vector-wise quantization starts to degrade — because of the emergent outlier features we'll see in Chapter 4.

Method	125M	1.3B	2.7B	6.7B	13B
FP32 baseline	25.65	15.91	14.43	13.30	12.45
INT8 absmax tensor-wise	87.76	16.55	15.11	14.59	19.08
INT8 absmax vector-wise	35.84	16.82	14.98	14.13	16.48
INT8 zeropoint vector-wise	25.72	15.94	14.36	13.38	13.47
LLM.int8()	25.83	15.93	14.44	13.24	12.45

Study this table carefully. Look at the 13B column. Tensor-wise absmax: 19.08 — worse than the 6.7B model. Vector-wise absmax: 16.48 — still terrible. Even zeropoint: 13.47, slightly degraded. Only LLM.int8() achieves 12.45, matching the FP32 baseline exactly. Something catastrophic happens at the 6.7B+ scale that only LLM.int8() can handle.

In vector-wise quantization, how many scaling constants are computed for the hidden state matrix X of shape [s, h]?

One per tensor (1 constant total) One per row (s constants total) — each row of X gets its own absmax scaling factor, because each row participates in a different set of inner products during matrix multiplication One per element (s × h constants total)

Chapter 4: Emergent Outlier Features

This is the paper's most important contribution: the discovery and characterization of emergent outlier features in large transformer hidden states. This section is fascinating because it reveals something genuinely surprising about how large language models work internally.

Dettmers et al. examined transformer hidden states X ∈ R^s×h across models from 125M to 13B parameters. They tracked individual feature dimensions h_i (columns of the hidden state matrix) across all layers, looking for values with magnitude ≥ 6.0.

What they found

At small scales (125M-1.3B parameters), outliers are rare and sporadic — maybe 1-2 dimensions with large values, appearing in about 25% of layers, affecting 6-18% of sequence positions. Quantization handles these fine because they're not systematic enough to dominate the scaling factor.

At medium scales (2.7B-6B), outliers become more common — 5-6 dimensions, appearing in 50-62% of layers. Quantization starts to struggle.

The phase transition at 6.7B parameters: Something dramatic happens. At exactly 6.7B parameters (measured in the fairseq model family), outliers suddenly appear in 100% of all transformer layers and 75% of all sequence positions. The outlier magnitudes jump from ~20 to ~40. And these outliers are concentrated in just 6 feature dimensions out of thousands.

For a 6.7B model with sequence length 2048, this means approximately 150,000 outlier values per sequence — but all concentrated in just 6 out of ~4096 hidden dimensions.

Model	Params	PPL	# Outlier dims	% Layers	% Seq dims	Outlier magnitude (quartiles)
GPT-2	117M	33.5	1	25%	6%	(-8, -7, -6)
GPT-2	345M	26.0	2	29%	18%	(6, 7, 8)
GPT-2	1.5B	21.0	2	41%	35%	(-11, -9, -7)
FSEQ	2.7B	14.4	5	52%	18%	(-25, -16, -9)
GPT-J	6.0B	13.8	6	62%	28%	(-21, -17, -14)
FSEQ	6.7B	13.3	6	100%	75%	(-44, -40, -35)
FSEQ	13B	12.5	7	100%	73%	(-63, -58, -45)

Look at the magnitude column. At 6.7B, the outlier values reach -44. At 13B, they reach -63. Normal feature values in a transformer hidden state are typically in the range [-3.5, +3.5]. These outliers are 10-20x larger than everything else.

Why do outliers matter for the model?

Dettmers et al. tested what happens when you remove outlier features by setting them to zero before they enter the attention layers. The results are dramatic:

Removing just 6-7 outlier dimensions (out of ~4096):
• Mean top-1 softmax probability drops from ~40% to ~20%
• Validation perplexity increases by 600-1000%

Removing 7 random (non-outlier) dimensions instead:
• Top-1 probability decreases by only 0.02-0.3%
• Perplexity increases by only ~0.1%

These features make up only about 0.1% of all input dimensions, but they carry an outsized fraction of the model's "decision-making" information. They dominate the attention softmax, steering which tokens attend to which. The model has learned to encode critical information in just a handful of dimensions with extreme magnitudes.

Why does this happen?

The paper observes that emergence correlates more closely with perplexity than raw model size — a better-trained smaller model could potentially trigger the phase shift. The outliers appear to be a strategy the model develops to create sharp attention patterns. When a dimension has magnitude 60 and everything else is < 3, the softmax becomes extremely peaked on the tokens with the large outlier values. This is the mechanism the model uses to "pay attention" to specific positions.

The outliers are also highly asymmetric — mostly one-sided (either all positive or all negative across the sequence dimension). This is important: it explains why zeropoint (asymmetric) quantization outperforms absmax (symmetric) quantization for models at the 6.7B+ scale.

Outlier Emergence Phase Transition

This visualization shows how outlier features progressively take over transformer layers as model scale increases. Hover/click a model size to see the pattern.

6.7B: phase transition

What makes the outlier emergence at 6.7B parameters a "phase transition" rather than a gradual change?

The percentage of transformer layers containing outliers jumps from ~62% to 100%, and the percentage of affected sequence dimensions jumps from ~28% to ~75%, in a single scale step — outliers go from appearing in some layers sometimes to appearing in ALL layers most of the time The model suddenly gets worse at language modeling The outliers change from positive to negative values

Chapter 5: Why Outliers Break INT8

Now we understand the two pieces of the puzzle: quantization schemes (Chapter 2-3) and emergent outliers (Chapter 4). Let's put them together and see exactly why standard quantization fails at scale.

The geometry of the problem

Remember that vector-wise quantization assigns one scaling constant per row of the hidden state X. A row of X is one token's hidden state — all h feature dimensions for a single sequence position. Outliers occur in specific feature columns — the same handful of hidden dimensions across all tokens.

The dimensional mismatch: Vector-wise quantization normalizes by row. Outliers live in columns. A row that contains an outlier of magnitude 60 alongside normal values of magnitude 2 will have its scaling factor dominated by 60, crushing the precision of all the normal values in that row. And since outliers occur in 75% of all sequence positions after the phase transition, 75% of all rows are contaminated.

Let's trace through a concrete example. Consider a hidden state with h=6 dimensions, where dimension 3 is an outlier dimension:

Example row from X at 6.7B scale:
X[token_42, :] = [0.5, -1.2, 0.8, -44.0, 0.3, -0.7]

Row-wise absmax:
s = 127 / 44.0 = 2.886
Quantized: [1, -3, 2, -127, 1, -2]

Dequantized:
[0.346, -1.040, 0.693, -44.0, 0.346, -0.693]

Errors:
dim 0: |0.5 - 0.346| = 0.154 (31% relative error!)
dim 1: |1.2 - 1.040| = 0.160 (13% error)
dim 3: |44.0 - 44.0| = 0.000 (perfect — it's at the boundary)

The outlier is quantized perfectly, but the normal values lose massive precision. With 5 out of 6 dimensions having ~15-30% relative error, the downstream computation is severely corrupted.

This is why the perplexity table shows degradation at 6.7B+. Every row that contains an outlier (75% of rows) has its normal-valued features crushed. These normal values — 99.9% of the tensor — carry the bulk of the semantic information. Destroying their precision destroys the model's ability to distinguish between similar tokens.

Why zeropoint helps but isn't enough

Recall that outlier features are almost always one-sided (all negative or all positive). Absmax quantization maps [-max, +max] to [-127, +127], wasting half the range when values are one-sided. Zeropoint quantization uses the asymmetric range [min, max] to [-127, +127], utilizing all 254 levels.

This explains why zeropoint outperforms absmax in the perplexity table at 6.7B (13.38 vs 14.13 for vector-wise). But by 13B, even zeropoint fails (13.47 vs 12.45 baseline) — the outlier magnitudes have grown so large (-63) that no amount of clever scaling within a single row can preserve precision for the normal values alongside them.

The fundamental impossibility

The problem is mathematically inescapable as long as we try to represent both outliers and normal values with the same 8-bit integers in the same row. We have 254 quantization levels. The range of values in a contaminated row is [-63, +3.5] or so. That's a span of 66.5. Each quantization step covers 66.5/254 = 0.26. For normal values that differ by 0.1 or less, they all quantize to the same integer. Information is permanently lost.

A concrete calculation: Consider a row with values [0.15, -0.23, 0.08, -58.0, -0.31, 0.19].

With absmax: scale = 127/58 = 2.19
• 0.15 × 2.19 = 0.33 → rounds to 0
• -0.23 × 2.19 = -0.50 → rounds to -1
• 0.08 × 2.19 = 0.18 → rounds to 0

Values 0.15 and 0.08 are identical after quantization — both are 0. The distinction between them is erased. Across an entire hidden state with 4096 dimensions and 75% of rows contaminated, this information loss is catastrophic.

Why column-wise quantization isn't the answer

You might wonder: why not quantize by column instead of by row? Then each column gets its own scaling constant, and the outlier columns get their own large scale. The problem is that column-wise quantization for W means row-wise for X and vice versa — you can't independently optimize both. The standard inner-product structure of matrix multiplication requires one axis for X and the orthogonal axis for W. Vector-wise quantization already uses the best combination (rows of X, columns of W), but the outliers are in the columns of X — the axis that doesn't get its own scaling constant.

The solution? Don't try. Separate the outliers from the normal values and handle them differently. This is exactly what mixed-precision decomposition does.

Why can't vector-wise (row-wise for X) quantization handle outlier features, even though it uses a separate scaling constant per row?

Because vector-wise quantization is slower than tensor-wise Because outlier features only appear in some rows Because outliers live in specific columns (feature dimensions) but row-wise quantization normalizes along rows — each contaminated row has its scale dominated by the outlier, crushing precision for the 99.9% of normal values sharing that row

Chapter 6: Mixed-Precision Decomposition

The insight is elegant in its simplicity: if outlier features live in a few columns and destroy quantization for everything they touch, just remove them before quantizing. Handle the outlier columns in full 16-bit precision. Quantize everything else to INT8. Then combine the results.

The algorithm step by step

Given input hidden states X_f16 ∈ R^s×h and weights W_f16 ∈ R^h×o:

1. Detect outlier columns

Scan X_f16 for columns (feature dimensions h_i) where any value has magnitude ≥ α = 6.0. Collect these indices into the outlier set O.

↓

2. Decompose X and W

Extract the outlier columns from X → X_outlier ∈ R^s×|O|. Extract the corresponding rows from W → W_outlier ∈ R^|O|×o. The remaining columns/rows form X_regular and W_regular.

↓

3. Multiply outliers in FP16

C_outlier = X_outlier · W_outlier (standard FP16 matmul, no quantization)

↓

4. Quantize and multiply regular values in INT8

Apply vector-wise absmax quantization to X_regular and W_regular. Perform INT8 matmul. Dequantize the result.

↓

5. Combine

C_f16 = C_outlier + C_regular (add in FP16)

The formula

In Einstein notation where indices are superscripts and h indexes the hidden dimension:

C_f16 ≈ ∑_{h ∈ O} X_f16^h W_f16^h + S_f16 · ∑_{h ∉ O} X_i8^h W_i8^h

The first sum handles outlier dimensions in FP16. The second sum handles everything else in INT8 with vector-wise quantization (S_f16 is the dequantization scaling matrix from the outer product of row/column constants).

Why the memory cost is negligible

The critical fact: for transformers up to 13B parameters, |O| ≤ 7. Out of a hidden dimension of ~5120 at the 13B scale, only 7 dimensions are outliers. That's 7/5120 = 0.14% of the dimensions handled in FP16. The remaining 99.86% are in INT8. The extra memory for storing the FP16 outlier columns is negligible — about 0.1% additional memory on top of the INT8 baseline.

python
import torch

def llm_int8_matmul(X_f16, W_f16, threshold=6.0):
    """LLM.int8() mixed-precision matrix multiplication."""
    # Step 1: Find outlier columns in X
    outlier_mask = X_f16.abs().max(dim=0).values > threshold  # [h]
    outlier_cols = outlier_mask.nonzero().squeeze(-1)         # indices
    regular_cols = (~outlier_mask).nonzero().squeeze(-1)      # indices

    # Step 2: Decompose
    X_outlier = X_f16[:, outlier_cols]          # [s, |O|]
    W_outlier = W_f16[outlier_cols, :]          # [|O|, o]
    X_regular = X_f16[:, regular_cols]          # [s, h-|O|]
    W_regular = W_f16[regular_cols, :]          # [h-|O|, o]

    # Step 3: Outliers in FP16
    C_outlier = X_outlier @ W_outlier           # [s, o] in FP16

    # Step 4: Regular values — vector-wise INT8 quantization
    # Row-wise scaling for X_regular
    cx = 127.0 / X_regular.abs().amax(dim=1, keepdim=True).clamp(min=1e-8)
    X_i8 = (X_regular * cx).round().clamp(-127, 127).to(torch.int8)

    # Column-wise scaling for W_regular
    cw = 127.0 / W_regular.abs().amax(dim=0, keepdim=True).clamp(min=1e-8)
    W_i8 = (W_regular * cw).round().clamp(-127, 127).to(torch.int8)

    # INT8 matmul + dequantize
    C_i32 = X_i8.int() @ W_i8.int()
    S = 1.0 / (cx @ cw)                       # outer product dequant
    C_regular = (C_i32.float() * S).half()      # [s, o] back to FP16

    # Step 5: Combine
    return C_outlier + C_regular

# Usage: drop-in replacement for any linear layer
X = torch.randn(2048, 4096, dtype=torch.float16, device='cuda')
W = torch.randn(4096, 4096, dtype=torch.float16, device='cuda')
output = llm_int8_matmul(X, W)

The α = 6.0 threshold: The paper finds that treating any feature with magnitude ≥ 6.0 as an outlier is sufficient to eliminate perplexity degradation. This threshold was determined empirically: below 6.0, too many non-critical features are separated (unnecessary overhead); above 6.0, some critical outliers get quantized to INT8 (precision loss). The value 6.0 sits at the sweet spot — roughly 2x the typical feature range of [-3.5, 3.5].

In the mixed-precision decomposition, what determines whether a feature dimension is treated as an "outlier" column?

If ANY value in that column of X has absolute magnitude ≥ 6.0, the entire column is extracted and multiplied in FP16 — because even one large value in a column indicates it's a systematic outlier dimension that appears across many tokens If the mean of the column exceeds 6.0 If the column has the highest variance in the tensor

Chapter 7: The Full LLM.int8() Algorithm

Let's put it all together. LLM.int8() is the combination of two techniques: vector-wise absmax quantization and mixed-precision decomposition. Together, they form a complete procedure that converts a 16-bit transformer checkpoint to 8-bit at inference time with zero performance degradation.

The complete data flow

For every linear layer in the transformer (feed-forward up-projection, down-projection, attention Q/K/V/O projections), the following happens at inference time:

Input: X_f16 [s, h] and W_f16 [h, o]

Hidden states from the previous layer (FP16) and the weight matrix (originally stored in FP16, can be pre-converted to INT8 for weights).

↓

Outlier detection

Find O = {i : max_s|X[s, i]| ≥ 6.0}. Typically |O| ≤ 7.

↓

Decompose

Split X into X_O [s, |O|] and X_R [s, h−|O|]. Split W into W_O [|O|, o] and W_R [h−|O|, o].

↓ parallel

FP16 path: C_O = X_O W_O

0.1% of dimensions. Full precision matmul. No quantization error.

INT8 path: quantize X_R (row-wise), W_R (col-wise), matmul, dequantize

99.9% of dimensions. Vector-wise absmax quantization. Minimal error since outliers are removed.

↓

Output: C_f16 = C_O + dequant(C_R)

Accumulate in FP16. Pass to next layer.

What layers get quantized?

LLM.int8() targets the matrix multiplications in:

Layer	Operation	Quantized?
FFN up-projection	X · W_up	Yes (INT8 + mixed-precision)
FFN down-projection	X · W_down	Yes (INT8 + mixed-precision)
Attention Q/K/V projections	X · W_Q, X · W_K, X · W_V	Yes (INT8 + mixed-precision)
Attention output projection	X · W_O	Yes (INT8 + mixed-precision)
Attention scores (Q · K^T)	Softmax(QK^T/√d)	No — no parameters, pure activation
Embeddings	Lookup table	No — not a matmul
LayerNorm	Normalization	No — tiny parameter count

This covers 95% of all model parameters and 65-85% of all computation. The remaining operations (embeddings, normalization, attention scores) stay in FP16.

Memory savings

Since 99.9% of values are stored in INT8 (1 byte) instead of FP16 (2 bytes), the memory savings are approximately 2x for the quantized layers. For BLOOM-176B, the total memory reduction is 1.96x — from ~352 GB to ~180 GB. The slight shortfall from a perfect 2x comes from the FP16 outlier columns, the scaling constants, and the non-quantized layers (embeddings, norms).

Runtime performance

The quantization and decomposition overhead matters for runtime speed. For models smaller than 6.7B, LLM.int8() is actually slower than FP16 due to the overhead of outlier detection and two separate matmuls. But for large models (6.7B+), INT8 tensor cores provide enough speedup to overcome the overhead:

Model size	FP16 baseline	Vector-wise INT8 (no decomp)	LLM.int8()
2.7B	1.00x	0.94x	0.64x (slower)
6.7B	1.00x	1.18x	0.86x
13B	1.00x	1.59x	1.22x (faster)
175B	1.00x	2.00x	1.81x (faster)

For BLOOM-176B end-to-end inference, LLM.int8() on 3x A100 80GB is comparable to FP16 on 8x A100 80GB — same speed, but using fewer than half the GPUs.

Why doesn't LLM.int8() quantize the attention score computation (softmax(QK^T / sqrt(d)))?

Because the attention function is too complex for INT8 Because the attention score computation has no weight parameters — LLM.int8() targets memory reduction, and since attention scores are computed from activations only (Q and K), quantizing them saves no memory; the Q, K, V projection weights ARE quantized Because attention scores are already in INT8

Chapter 8: Quantization Explorer

Now let's see the full LLM.int8() algorithm in action. This interactive simulation lets you experience the mixed-precision decomposition on a real(istic) hidden state matrix. You can control the outlier magnitude, the number of outlier dimensions, and see how the quantization error changes with and without mixed-precision decomposition.

LLM.int8() Mixed-Precision Decomposition

A simulated hidden state matrix X [8 tokens, 16 dims]. Outlier columns are highlighted. Compare standard INT8 quantization vs. LLM.int8() decomposition. Drag the sliders to change outlier properties.

Outlier magnitude 40

Outlier columns 2

Mixed-precision ON — outliers in FP16

Notice how in standard INT8 mode, increasing the outlier magnitude causes the error for normal values to spike dramatically. Switch to LLM.int8() mode and the normal-value error drops to near-zero because outliers are handled separately in FP16.

What happens inside the simulation

The simulation does exactly what the paper's algorithm does:

Standard INT8: All values are quantized with a single row-wise scaling factor. The outlier dominates the scale, crushing precision for normal values. Try setting the outlier magnitude to 70 and observe the red error bars — almost every token row has substantial error.

LLM.int8(): Outlier columns are extracted and multiplied in FP16 (zero error). The remaining values are quantized with their own row-wise scaling factor — now undistorted by outliers. The two results are summed. Total error is orders of magnitude lower. The dashed red lines show where the standard INT8 error would be — the improvement is dramatic.

Experiments to try

1. Increase outlier magnitude from 5 to 70. In standard mode, watch the error grow linearly. In LLM.int8() mode, the error for normal values stays flat — because the outlier magnitude doesn't affect the INT8 quantization path at all.

2. Set outlier columns to 0. Both modes become identical — there's nothing to decompose. This is why LLM.int8() has no benefit for models below 6.7B (few or no outliers).

3. Set outlier columns to 4. More outlier columns means more values handled in FP16, but the paper shows that even at 13B scale, there are at most 7 outlier dimensions. The overhead of the FP16 path stays negligible.

The key takeaway from playing with this simulation: The error from standard INT8 grows linearly with outlier magnitude. The error from LLM.int8() stays approximately constant regardless of outlier magnitude, because the outliers never enter the quantization path. This is why LLM.int8() scales where other methods don't — the outliers get worse at scale, but LLM.int8() is invariant to outlier magnitude.

In the LLM.int8() decomposition, when you increase the outlier magnitude from 40 to 70, what happens to the quantization error of the REGULAR (non-outlier) features?

It stays approximately the same — the outlier columns are handled entirely in FP16, so their magnitude has no effect on the INT8 quantization of the remaining columns, which have their own independent scaling constants It increases proportionally to the outlier magnitude It decreases because larger outliers are easier to detect

Chapter 9: Connections

LLM.int8() was published in August 2022 at NeurIPS and immediately changed the landscape of LLM deployment. Let's place it in the broader context of model compression and efficient inference.

What this paper established

Three lasting contributions:
1. The first zero-degradation quantization at 175B scale. Prior work topped out at 350M parameters. LLM.int8() jumped 500x in model size with no loss.
2. The discovery of emergent outlier features. The phase transition at 6.7B — outliers appearing in all layers, all tokens, concentrated in a handful of dimensions — was a genuine surprise. It revealed something fundamental about how large transformers organize information.
3. The bitsandbytes library. The open-source implementation was integrated directly into Hugging Face Transformers, making INT8 inference available to millions of users with a one-line code change.

What came next

Method	Year	Key idea	Bits	Relationship to LLM.int8()
GPTQ	2022	Layer-wise post-training quantization using Hessian info	4-bit	Complementary — pushes to fewer bits with calibration data
QLoRA	2023	4-bit NormalFloat + LoRA for finetuning	4-bit	By Dettmers — extends bitsandbytes to training
AWQ	2023	Activation-aware weight quantization	4-bit	Builds on the outlier insight — protects important weights
SqueezeLLM	2023	Non-uniform quantization with sparse outliers	3-4-bit	Directly extends the outlier decomposition idea
FP8	2023+	Native 8-bit float hardware support (H100+)	8-bit float	May eventually supersede INT8 quantization

Limitations acknowledged by the authors

INT8 only. The paper does not study FP8 data types. Since 2022, NVIDIA H100 and later GPUs added native FP8 support, which may provide better precision-performance tradeoffs than INT8 for the same bit width.

Inference only. LLM.int8() is designed for inference. The authors' initial experiments with INT8 training showed degradation for attention projections at scale (Appendix E of the paper). Training requires different techniques — this gap was partially filled by QLoRA in 2023.

Attention scores not quantized. The QK^T attention computation stays in FP16. For long-context models where attention memory dominates, this is a significant limitation.

Overhead at small scale. For models under 6.7B, LLM.int8() is slower than FP16. But models under 6.7B generally fit in GPU memory without quantization, so this is rarely a practical concern.

The deeper insight

Perhaps the most lasting impact of this paper isn't the quantization algorithm itself — it's the discovery of emergent features. The finding that large transformers spontaneously develop a handful of extreme-magnitude dimensions, that these dimensions are critical for attention, and that their emergence follows a phase transition — this tells us something fundamental about how neural networks organize information at scale. Subsequent work on model interpretability has built on this observation, studying what these outlier dimensions encode and why they emerge.

From the paper: "While up to 150k outliers exist per 2048 token sequence for a 13B model, these outlier features are highly systematic and only represent at most 7 unique feature dimensions. Insights from this analysis were critical to developing mixed-precision decomposition."

Using LLM.int8() today

python
# One line to load any model in 8-bit via bitsandbytes + HuggingFace
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    load_in_8bit=True,          # <-- that's it
    device_map="auto"
)
# Model is now ~35 GB instead of ~140 GB
# Zero performance degradation

What is the most lasting scientific contribution of the LLM.int8() paper, beyond the practical quantization algorithm?

The discovery that INT8 arithmetic is faster than FP16 The discovery and characterization of emergent outlier features — a phase transition at ~6.7B parameters where extreme-magnitude values appear in ALL transformer layers in just a handful of hidden dimensions, fundamentally revealing how large models organize critical information The Hugging Face integration

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Chapter 0: The Memory Wall

Where the memory goes

Chapter 1: Quantization 101

Why does quantization matter for matrix multiplication?

Chapter 2: Absmax Quantization — Deriving It

Step 1: Compute the scaling factor

Step 2: Quantize

Step 3: Dequantize

Worked example

Chapter 3: Vector-wise Quantization

Three levels of quantization granularity

The math

Worked example

Chapter 4: Emergent Outlier Features

What they found

Why do outliers matter for the model?

Why does this happen?

Chapter 5: Why Outliers Break INT8

The geometry of the problem

Why zeropoint helps but isn't enough

The fundamental impossibility

The fundamental impossibility

Why column-wise quantization isn't the answer

Chapter 6: Mixed-Precision Decomposition

The algorithm step by step

The formula

Why the memory cost is negligible

Chapter 7: The Full LLM.int8() Algorithm

The complete data flow

What layers get quantized?

Memory savings

Runtime performance

Chapter 8: Quantization Explorer

What happens inside the simulation

Experiments to try

Chapter 9: Connections

What this paper established

What came next

Limitations acknowledged by the authors

The deeper insight

Using LLM.int8() today