The paper that made 175B-parameter models fit on consumer GPUs — by discovering that outlier features break quantization and inventing mixed-precision decomposition to fix it.
You've trained a 175-billion-parameter language model. It took months of compute on thousands of GPUs. Now you want to use it — run inference, answer questions, generate text. One problem: the model's weights alone consume 350 GB of memory in FP16 (16-bit floating point). A single NVIDIA A100 GPU has 80 GB. You need at least 5 GPUs just to hold the model, before you've processed a single token.
This is the memory wall. For large language models at and beyond 6.7B parameters, the feed-forward and attention projection layers — the matrix multiplications — are responsible for 95% of all parameters and 65-85% of all computation. The bottleneck is not compute speed; it's how much model you can fit in memory.
What if you could represent each weight with 8 bits instead of 16? You'd cut memory in half. A 175B model would shrink from 350 GB to 175 GB — suddenly fitting on a single server with consumer GPUs. The arithmetic would be faster too: INT8 tensor cores can do roughly twice the throughput of FP16.
Let's see the scale of the problem. Here's what each hardware setup can run in 16-bit vs. 8-bit precision:
| Hardware | GPU Memory | Largest Model (16-bit) | Largest Model (8-bit) |
|---|---|---|---|
| 8x A100 80GB | 640 GB total | OPT-175B / BLOOM | OPT-175B / BLOOM |
| 8x A100 40GB | 320 GB total | OPT-66B | OPT-175B / BLOOM |
| 8x RTX 3090 | 192 GB total | OPT-66B | OPT-175B / BLOOM |
| 4x RTX 3090 | 96 GB total | OPT-30B | OPT-66B |
| Colab Pro | 15 GB | GPT-J-6B | OPT-13B |
| Free Colab | 12 GB | GPT-2 1.3B | T0/T5-11B |
Look at that last row. With 8-bit quantization, a free Google Colab notebook — something any student can access — can run an 11B parameter model. In 16-bit, you're limited to 1.3B. That's a nearly 10x improvement in accessible model size.
Each bar shows the GPU memory required to hold model weights. Click a model size to see the comparison.
Let's do the arithmetic. A 175B-parameter model stores each parameter as a 16-bit float = 2 bytes. Total: 175 × 109 × 2 bytes = 350 GB. But that's just the weights. During inference, you also need memory for:
| Component | What it stores | Approximate size (175B) |
|---|---|---|
| Model weights | WQ, WK, WV, WO, Wup, Wdown for each layer | 350 GB (FP16) / 175 GB (INT8) |
| KV cache | Key and value tensors for all past tokens | Grows with sequence length |
| Activations | Intermediate hidden states during forward pass | ~1-5 GB (depends on batch size) |
| CUDA overhead | Framework buffers, memory fragmentation | ~2-5 GB |
The weights dominate. If you can halve the weight memory, you can fit the model on half the GPUs — or fit a model twice as large on the same hardware.
Dettmers et al. solved this problem. Their method, LLM.int8(), quantizes transformer weights to 8-bit integers with zero performance degradation — even at 175B parameters. The key insight? They discovered that quantization fails at scale because of a handful of emergent outlier features — values 20x larger than everything else — that appear in specific hidden dimensions once models exceed about 6.7B parameters. By isolating these outliers and handling them in 16-bit while quantizing everything else to 8-bit, they achieved lossless compression.
This wasn't just an engineering improvement. It was a scientific discovery about how transformers organize information at scale — and that discovery is what made the engineering possible.
Let's understand how, starting with the basics of quantization.
Before we can understand LLM.int8(), we need to understand what quantization means and why it's hard. The core idea is simple: represent numbers with fewer bits.
A 16-bit floating point number (FP16) can represent values from approximately -65,504 to +65,504 with varying precision. It uses 1 sign bit, 5 exponent bits, and 10 mantissa bits. This gives you fine-grained precision near zero and coarser precision for large values — exactly what you want for neural network weights, which tend to cluster near zero.
An 8-bit signed integer (INT8) can represent exactly 256 values: the integers from -128 to +127. That's it. No fractions, no exponents. Just 256 evenly spaced integers.
To feel the difference: FP16 can distinguish between 1.0000 and 1.0010. INT8 can only distinguish between 1 and 2. If a weight has value 1.3, INT8 must round it to either 1 or 2 — losing the fractional part. This rounding error is quantization error, and the entire challenge of model quantization is keeping it small enough that the model doesn't notice.
Think of it like compressing a high-resolution photograph into a 256-color palette. You need a mapping function that assigns each original color to its closest palette entry, and a reverse mapping (dequantization) that converts back. The quality of the compressed image depends entirely on how cleverly you choose those 256 colors.
For neural networks, there are two main approaches:
Let's build these from scratch.
A transformer's core operation is matrix multiplication: hidden states X (shape [sequence_length, hidden_dim]) multiplied by weight matrices W (shape [hidden_dim, output_dim]). In FP16, each element of X and W is a 16-bit float. The multiplication produces a 32-bit output that's then cast back to FP16.
If we quantize both X and W to INT8, we halve the memory for storing weights and can use INT8 tensor cores, which are roughly 2x faster than FP16 cores on modern GPUs. The INT8 multiplication produces INT32 outputs (because two 8-bit integers multiplied can be up to 127 × 127 = 16,129, needing more than 8 bits), which we dequantize back to FP16. The key question: does this round-trip — quantize, multiply, dequantize — preserve the original FP16 result?
Where Sf16 is a scaling factor that converts the INT32 result back to FP16. The "approximately equals" is the crux — how close is the approximation? That depends on the quantization scheme. A good scheme minimizes the error between Cf16 (the true FP16 matmul) and the dequantized INT8 result. A bad scheme introduces enough error that the model's outputs become garbage.
Let's derive absmax quantization from first principles. We have a floating-point tensor Xf16 with shape [s, h] (sequence length by hidden dimension). We want to map every value into the integer range [-127, +127].
The strategy: find the largest absolute value in the tensor, and scale everything so that value maps to 127 (or -127). All other values are scaled proportionally and rounded to the nearest integer.
Find the infinity norm of the tensor — the maximum absolute value across all elements:
This is the scaling factor. If the largest value in X is 5.0, then sx = 127 / 5.0 = 25.4. Every value gets multiplied by 25.4, so 5.0 maps to 127, 2.5 maps to ~64, and 0.1 maps to ~3.
Where ⌊·⌉ denotes rounding to the nearest integer. That's the entire quantization formula. Multiply by the scaling factor, round to integer, clamp to [-127, 127].
To recover the approximate original value, divide by the scaling factor:
Suppose we have a small tensor Xf16 = [-0.8, 1.5, 0.3, -2.1, 0.7].
Not bad! The worst error is 0.006. But now imagine what happens if one value in the tensor is an outlier with magnitude 60:
This is foreshadowing. Remember this: a single large outlier destroys quantization precision for an entire tensor. We'll see that this is exactly what happens in large transformers.
A random tensor of 32 values with a controllable outlier. Watch how one outlier destroys quantization precision for everything else.
python import torch def absmax_quantize(X_f16): """Absmax quantization: FP16 -> INT8""" scale = 127.0 / X_f16.abs().max() # s_x = 127 / ||X||_inf X_i8 = (X_f16 * scale).round().clamp(-127, 127).to(torch.int8) return X_i8, scale def absmax_dequantize(X_i8, scale): """Dequantize: INT8 -> FP16""" return X_i8.float() / scale # Example: quantize a weight matrix W = torch.randn(4096, 4096, dtype=torch.float16) W_i8, scale = absmax_quantize(W) W_recovered = absmax_dequantize(W_i8, scale) print(f"Max error: {(W.float() - W_recovered).abs().max():.6f}") # Typical output: Max error: 0.012
In the previous chapter, we used a single scaling factor for the entire tensor. One outlier anywhere ruins precision everywhere. The natural fix: use more scaling factors.
The key insight from Dettmers et al. is to view matrix multiplication as a sequence of independent inner products. Consider multiplying X ∈ Rs×h by W ∈ Rh×o. Each element of the output C[i, j] is the inner product of row i of X with column j of W. These inner products are independent — they don't share any computation.
| Method | Scaling constants | # Constants | Outlier isolation |
|---|---|---|---|
| Tensor-wise | One per tensor | 1 for X, 1 for W | None — one outlier kills everything |
| Row-wise | One per row of X, one for all of W | s for X, 1 for W | Partially — isolates outliers to their row |
| Vector-wise | One per row of X, one per column of W | s for X, o for W | Best — each inner product has its own scale |
For vector-wise quantization, we assign a scaling constant cxi to each row i of X (computed as 127 / max|X[i, :]|) and a constant cwj to each column j of W (computed as 127 / max|W[:, j]|). The quantized matrix multiplication becomes:
Where cx ∈ Rs is the vector of row-wise scaling constants, cw ∈ Ro is the vector of column-wise constants, and ⊗ denotes the outer product. The dequantization matrix S = (cx ⊗ cw)−1 has shape [s, o] — one dequantization factor per output element.
Suppose X is 2×3 and W is 3×2:
python import torch def vector_wise_quantize(X, W): """Vector-wise quantization for matrix multiplication.""" # Row-wise scaling for X: one constant per row cx = 127.0 / X.abs().amax(dim=1, keepdim=True).clamp(min=1e-8) # shape [s, 1] X_i8 = (X * cx).round().clamp(-127, 127).to(torch.int8) # Column-wise scaling for W: one constant per column cw = 127.0 / W.abs().amax(dim=0, keepdim=True).clamp(min=1e-8) # shape [1, o] W_i8 = (W * cw).round().clamp(-127, 127).to(torch.int8) # INT8 matmul (accumulate in INT32) C_i32 = X_i8.int() @ W_i8.int() # shape [s, o] # Dequantize: outer product of inverse scaling constants S = 1.0 / (cx @ cw) # shape [s, o] — one factor per output C_f16 = (C_i32.float() * S).half() return C_f16
The paper shows that vector-wise quantization preserves perplexity up to about 2.7B parameters. Beyond that, even vector-wise quantization starts to degrade — because of the emergent outlier features we'll see in Chapter 4.
| Method | 125M | 1.3B | 2.7B | 6.7B | 13B |
|---|---|---|---|---|---|
| FP32 baseline | 25.65 | 15.91 | 14.43 | 13.30 | 12.45 |
| INT8 absmax tensor-wise | 87.76 | 16.55 | 15.11 | 14.59 | 19.08 |
| INT8 absmax vector-wise | 35.84 | 16.82 | 14.98 | 14.13 | 16.48 |
| INT8 zeropoint vector-wise | 25.72 | 15.94 | 14.36 | 13.38 | 13.47 |
| LLM.int8() | 25.83 | 15.93 | 14.44 | 13.24 | 12.45 |
Study this table carefully. Look at the 13B column. Tensor-wise absmax: 19.08 — worse than the 6.7B model. Vector-wise absmax: 16.48 — still terrible. Even zeropoint: 13.47, slightly degraded. Only LLM.int8() achieves 12.45, matching the FP32 baseline exactly. Something catastrophic happens at the 6.7B+ scale that only LLM.int8() can handle.
This is the paper's most important contribution: the discovery and characterization of emergent outlier features in large transformer hidden states. This section is fascinating because it reveals something genuinely surprising about how large language models work internally.
Dettmers et al. examined transformer hidden states X ∈ Rs×h across models from 125M to 13B parameters. They tracked individual feature dimensions hi (columns of the hidden state matrix) across all layers, looking for values with magnitude ≥ 6.0.
At small scales (125M-1.3B parameters), outliers are rare and sporadic — maybe 1-2 dimensions with large values, appearing in about 25% of layers, affecting 6-18% of sequence positions. Quantization handles these fine because they're not systematic enough to dominate the scaling factor.
At medium scales (2.7B-6B), outliers become more common — 5-6 dimensions, appearing in 50-62% of layers. Quantization starts to struggle.
| Model | Params | PPL | # Outlier dims | % Layers | % Seq dims | Outlier magnitude (quartiles) |
|---|---|---|---|---|---|---|
| GPT-2 | 117M | 33.5 | 1 | 25% | 6% | (-8, -7, -6) |
| GPT-2 | 345M | 26.0 | 2 | 29% | 18% | (6, 7, 8) |
| GPT-2 | 1.5B | 21.0 | 2 | 41% | 35% | (-11, -9, -7) |
| FSEQ | 2.7B | 14.4 | 5 | 52% | 18% | (-25, -16, -9) |
| GPT-J | 6.0B | 13.8 | 6 | 62% | 28% | (-21, -17, -14) |
| FSEQ | 6.7B | 13.3 | 6 | 100% | 75% | (-44, -40, -35) |
| FSEQ | 13B | 12.5 | 7 | 100% | 73% | (-63, -58, -45) |
Look at the magnitude column. At 6.7B, the outlier values reach -44. At 13B, they reach -63. Normal feature values in a transformer hidden state are typically in the range [-3.5, +3.5]. These outliers are 10-20x larger than everything else.
Dettmers et al. tested what happens when you remove outlier features by setting them to zero before they enter the attention layers. The results are dramatic:
These features make up only about 0.1% of all input dimensions, but they carry an outsized fraction of the model's "decision-making" information. They dominate the attention softmax, steering which tokens attend to which. The model has learned to encode critical information in just a handful of dimensions with extreme magnitudes.
The paper observes that emergence correlates more closely with perplexity than raw model size — a better-trained smaller model could potentially trigger the phase shift. The outliers appear to be a strategy the model develops to create sharp attention patterns. When a dimension has magnitude 60 and everything else is < 3, the softmax becomes extremely peaked on the tokens with the large outlier values. This is the mechanism the model uses to "pay attention" to specific positions.
The outliers are also highly asymmetric — mostly one-sided (either all positive or all negative across the sequence dimension). This is important: it explains why zeropoint (asymmetric) quantization outperforms absmax (symmetric) quantization for models at the 6.7B+ scale.
This visualization shows how outlier features progressively take over transformer layers as model scale increases. Hover/click a model size to see the pattern.
Now we understand the two pieces of the puzzle: quantization schemes (Chapter 2-3) and emergent outliers (Chapter 4). Let's put them together and see exactly why standard quantization fails at scale.
Remember that vector-wise quantization assigns one scaling constant per row of the hidden state X. A row of X is one token's hidden state — all h feature dimensions for a single sequence position. Outliers occur in specific feature columns — the same handful of hidden dimensions across all tokens.
Let's trace through a concrete example. Consider a hidden state with h=6 dimensions, where dimension 3 is an outlier dimension:
This is why the perplexity table shows degradation at 6.7B+. Every row that contains an outlier (75% of rows) has its normal-valued features crushed. These normal values — 99.9% of the tensor — carry the bulk of the semantic information. Destroying their precision destroys the model's ability to distinguish between similar tokens.
Recall that outlier features are almost always one-sided (all negative or all positive). Absmax quantization maps [-max, +max] to [-127, +127], wasting half the range when values are one-sided. Zeropoint quantization uses the asymmetric range [min, max] to [-127, +127], utilizing all 254 levels.
This explains why zeropoint outperforms absmax in the perplexity table at 6.7B (13.38 vs 14.13 for vector-wise). But by 13B, even zeropoint fails (13.47 vs 12.45 baseline) — the outlier magnitudes have grown so large (-63) that no amount of clever scaling within a single row can preserve precision for the normal values alongside them.
The problem is mathematically inescapable as long as we try to represent both outliers and normal values with the same 8-bit integers in the same row. We have 254 quantization levels. The range of values in a contaminated row is [-63, +3.5] or so. That's a span of 66.5. Each quantization step covers 66.5/254 = 0.26. For normal values that differ by 0.1 or less, they all quantize to the same integer. Information is permanently lost.
You might wonder: why not quantize by column instead of by row? Then each column gets its own scaling constant, and the outlier columns get their own large scale. The problem is that column-wise quantization for W means row-wise for X and vice versa — you can't independently optimize both. The standard inner-product structure of matrix multiplication requires one axis for X and the orthogonal axis for W. Vector-wise quantization already uses the best combination (rows of X, columns of W), but the outliers are in the columns of X — the axis that doesn't get its own scaling constant.
The solution? Don't try. Separate the outliers from the normal values and handle them differently. This is exactly what mixed-precision decomposition does.
The insight is elegant in its simplicity: if outlier features live in a few columns and destroy quantization for everything they touch, just remove them before quantizing. Handle the outlier columns in full 16-bit precision. Quantize everything else to INT8. Then combine the results.
Given input hidden states Xf16 ∈ Rs×h and weights Wf16 ∈ Rh×o:
In Einstein notation where indices are superscripts and h indexes the hidden dimension:
The first sum handles outlier dimensions in FP16. The second sum handles everything else in INT8 with vector-wise quantization (Sf16 is the dequantization scaling matrix from the outer product of row/column constants).
The critical fact: for transformers up to 13B parameters, |O| ≤ 7. Out of a hidden dimension of ~5120 at the 13B scale, only 7 dimensions are outliers. That's 7/5120 = 0.14% of the dimensions handled in FP16. The remaining 99.86% are in INT8. The extra memory for storing the FP16 outlier columns is negligible — about 0.1% additional memory on top of the INT8 baseline.
python import torch def llm_int8_matmul(X_f16, W_f16, threshold=6.0): """LLM.int8() mixed-precision matrix multiplication.""" # Step 1: Find outlier columns in X outlier_mask = X_f16.abs().max(dim=0).values > threshold # [h] outlier_cols = outlier_mask.nonzero().squeeze(-1) # indices regular_cols = (~outlier_mask).nonzero().squeeze(-1) # indices # Step 2: Decompose X_outlier = X_f16[:, outlier_cols] # [s, |O|] W_outlier = W_f16[outlier_cols, :] # [|O|, o] X_regular = X_f16[:, regular_cols] # [s, h-|O|] W_regular = W_f16[regular_cols, :] # [h-|O|, o] # Step 3: Outliers in FP16 C_outlier = X_outlier @ W_outlier # [s, o] in FP16 # Step 4: Regular values — vector-wise INT8 quantization # Row-wise scaling for X_regular cx = 127.0 / X_regular.abs().amax(dim=1, keepdim=True).clamp(min=1e-8) X_i8 = (X_regular * cx).round().clamp(-127, 127).to(torch.int8) # Column-wise scaling for W_regular cw = 127.0 / W_regular.abs().amax(dim=0, keepdim=True).clamp(min=1e-8) W_i8 = (W_regular * cw).round().clamp(-127, 127).to(torch.int8) # INT8 matmul + dequantize C_i32 = X_i8.int() @ W_i8.int() S = 1.0 / (cx @ cw) # outer product dequant C_regular = (C_i32.float() * S).half() # [s, o] back to FP16 # Step 5: Combine return C_outlier + C_regular # Usage: drop-in replacement for any linear layer X = torch.randn(2048, 4096, dtype=torch.float16, device='cuda') W = torch.randn(4096, 4096, dtype=torch.float16, device='cuda') output = llm_int8_matmul(X, W)
Let's put it all together. LLM.int8() is the combination of two techniques: vector-wise absmax quantization and mixed-precision decomposition. Together, they form a complete procedure that converts a 16-bit transformer checkpoint to 8-bit at inference time with zero performance degradation.
For every linear layer in the transformer (feed-forward up-projection, down-projection, attention Q/K/V/O projections), the following happens at inference time:
LLM.int8() targets the matrix multiplications in:
| Layer | Operation | Quantized? |
|---|---|---|
| FFN up-projection | X · Wup | Yes (INT8 + mixed-precision) |
| FFN down-projection | X · Wdown | Yes (INT8 + mixed-precision) |
| Attention Q/K/V projections | X · WQ, X · WK, X · WV | Yes (INT8 + mixed-precision) |
| Attention output projection | X · WO | Yes (INT8 + mixed-precision) |
| Attention scores (Q · KT) | Softmax(QKT/√d) | No — no parameters, pure activation |
| Embeddings | Lookup table | No — not a matmul |
| LayerNorm | Normalization | No — tiny parameter count |
This covers 95% of all model parameters and 65-85% of all computation. The remaining operations (embeddings, normalization, attention scores) stay in FP16.
Since 99.9% of values are stored in INT8 (1 byte) instead of FP16 (2 bytes), the memory savings are approximately 2x for the quantized layers. For BLOOM-176B, the total memory reduction is 1.96x — from ~352 GB to ~180 GB. The slight shortfall from a perfect 2x comes from the FP16 outlier columns, the scaling constants, and the non-quantized layers (embeddings, norms).
The quantization and decomposition overhead matters for runtime speed. For models smaller than 6.7B, LLM.int8() is actually slower than FP16 due to the overhead of outlier detection and two separate matmuls. But for large models (6.7B+), INT8 tensor cores provide enough speedup to overcome the overhead:
| Model size | FP16 baseline | Vector-wise INT8 (no decomp) | LLM.int8() |
|---|---|---|---|
| 2.7B | 1.00x | 0.94x | 0.64x (slower) |
| 6.7B | 1.00x | 1.18x | 0.86x |
| 13B | 1.00x | 1.59x | 1.22x (faster) |
| 175B | 1.00x | 2.00x | 1.81x (faster) |
For BLOOM-176B end-to-end inference, LLM.int8() on 3x A100 80GB is comparable to FP16 on 8x A100 80GB — same speed, but using fewer than half the GPUs.
Now let's see the full LLM.int8() algorithm in action. This interactive simulation lets you experience the mixed-precision decomposition on a real(istic) hidden state matrix. You can control the outlier magnitude, the number of outlier dimensions, and see how the quantization error changes with and without mixed-precision decomposition.
A simulated hidden state matrix X [8 tokens, 16 dims]. Outlier columns are highlighted. Compare standard INT8 quantization vs. LLM.int8() decomposition. Drag the sliders to change outlier properties.
Notice how in standard INT8 mode, increasing the outlier magnitude causes the error for normal values to spike dramatically. Switch to LLM.int8() mode and the normal-value error drops to near-zero because outliers are handled separately in FP16.
The simulation does exactly what the paper's algorithm does:
Standard INT8: All values are quantized with a single row-wise scaling factor. The outlier dominates the scale, crushing precision for normal values. Try setting the outlier magnitude to 70 and observe the red error bars — almost every token row has substantial error.
LLM.int8(): Outlier columns are extracted and multiplied in FP16 (zero error). The remaining values are quantized with their own row-wise scaling factor — now undistorted by outliers. The two results are summed. Total error is orders of magnitude lower. The dashed red lines show where the standard INT8 error would be — the improvement is dramatic.
1. Increase outlier magnitude from 5 to 70. In standard mode, watch the error grow linearly. In LLM.int8() mode, the error for normal values stays flat — because the outlier magnitude doesn't affect the INT8 quantization path at all.
2. Set outlier columns to 0. Both modes become identical — there's nothing to decompose. This is why LLM.int8() has no benefit for models below 6.7B (few or no outliers).
3. Set outlier columns to 4. More outlier columns means more values handled in FP16, but the paper shows that even at 13B scale, there are at most 7 outlier dimensions. The overhead of the FP16 path stays negligible.
LLM.int8() was published in August 2022 at NeurIPS and immediately changed the landscape of LLM deployment. Let's place it in the broader context of model compression and efficient inference.
| Method | Year | Key idea | Bits | Relationship to LLM.int8() |
|---|---|---|---|---|
| GPTQ | 2022 | Layer-wise post-training quantization using Hessian info | 4-bit | Complementary — pushes to fewer bits with calibration data |
| QLoRA | 2023 | 4-bit NormalFloat + LoRA for finetuning | 4-bit | By Dettmers — extends bitsandbytes to training |
| AWQ | 2023 | Activation-aware weight quantization | 4-bit | Builds on the outlier insight — protects important weights |
| SqueezeLLM | 2023 | Non-uniform quantization with sparse outliers | 3-4-bit | Directly extends the outlier decomposition idea |
| FP8 | 2023+ | Native 8-bit float hardware support (H100+) | 8-bit float | May eventually supersede INT8 quantization |
INT8 only. The paper does not study FP8 data types. Since 2022, NVIDIA H100 and later GPUs added native FP8 support, which may provide better precision-performance tradeoffs than INT8 for the same bit width.
Inference only. LLM.int8() is designed for inference. The authors' initial experiments with INT8 training showed degradation for attention projections at scale (Appendix E of the paper). Training requires different techniques — this gap was partially filled by QLoRA in 2023.
Attention scores not quantized. The QKT attention computation stays in FP16. For long-context models where attention memory dominates, this is a significant limitation.
Overhead at small scale. For models under 6.7B, LLM.int8() is slower than FP16. But models under 6.7B generally fit in GPU memory without quantization, so this is rarely a practical concern.
Perhaps the most lasting impact of this paper isn't the quantization algorithm itself — it's the discovery of emergent features. The finding that large transformers spontaneously develop a handful of extreme-magnitude dimensions, that these dimensions are critical for attention, and that their emergence follows a phase transition — this tells us something fundamental about how neural networks organize information at scale. Subsequent work on model interpretability has built on this observation, studying what these outlier dimensions encode and why they emerge.
python # One line to load any model in 8-bit via bitsandbytes + HuggingFace from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b", load_in_8bit=True, # <-- that's it device_map="auto" ) # Model is now ~35 GB instead of ~140 GB # Zero performance degradation