EE269 Lecture 4 — Lloyd-Max Optimal Quantizers

Chapter 0: Why Non-Uniform?

You built a uniform quantizer in the last lecture. Equal-width bins, evenly spaced reconstruction levels. Clean, simple, analyzable. Now here's the problem: most real signals are not uniformly distributed.

Imagine you're quantizing speech. When you record someone talking, most of the audio samples cluster near zero — quiet moments, consonants, pauses. Loud vowels hit the extremes only briefly. If you histogram the amplitude values, you get a tall spike near zero with long thin tails. A Laplacian distribution, roughly.

Now picture your uniform quantizer with 8 levels spread evenly from -3 to +3. Three or four of those levels sit in the tails where the signal almost never goes. You've wasted half your resolution on empty space. Meanwhile, the region near zero — where most of the signal lives — gets only two or three levels crammed together. The quantization error is enormous precisely where it matters most.

The core insight: Uniform bins waste resolution where the pdf is small and under-resolve where the pdf is large. The fix: allocate more bins where f_X(x) is high.

This isn't just speech. Gaussian noise from sensors clusters near the mean. Neural network weights (as we'll see in Chapter 8) follow approximately normal distributions. Heavy-tailed activations in transformers have most mass near zero with rare but important outliers. In every case, the signal's statistics are non-uniform, so the quantizer should be too.

The Evidence: Amplitude Histograms

Pilanci shows histograms of amplitudes from real-world signals. The pattern is unmistakable:

Speech (telephony): A narrow spike near zero with exponential tails. Most of the time, you're between words or producing quiet consonants. The distribution is approximately Laplacian with scale parameter around 0.1–0.3.

Transform coefficients (DCT/wavelets): Even more peaked than speech. After a frequency transform, most coefficients are near zero (background regions, smooth gradients). Only a few coefficients carry significant energy (edges, texture). The distribution is super-Gaussian — heavier peak, longer tails than Gaussian.

Sensor noise: Close to Gaussian. Thermal noise in cameras, ADC quantization noise, accelerometer drift — all cluster around zero with well-behaved symmetric tails.

ML activations and weights: Weights are approximately Gaussian (we'll see this in detail in Chapter 8). Activations are more varied — ReLU outputs are half-Gaussian (zero for negative inputs), attention logits can have heavy tails, and layer-norm outputs are standardized Gaussian.

In none of these cases is the distribution uniform. A uniform quantizer is optimal only for the uniform distribution — which almost never occurs in practice.

Pilanci poses two questions that drive this entire lecture:

Question 1 (Finite levels): Given a fixed number of quantization levels M, how should we choose the decision regions and reconstruction values to minimize distortion?

Question 2 (Asymptotic): As M becomes very large, how should reconstruction values be distributed across the real line?

The answer to Question 1 is the Lloyd-Max algorithm (1960s-80s). The answer to Question 2 is the high-rate theory with its surprising 1/3 power law. Both are elegant, both are practical, and both connect directly to how we compress billion-parameter language models today.

A Quantitative Taste

How much does non-uniform quantization help? For a Gaussian source with 3 bits (M=8):

• Best uniform quantizer (optimized clipping at ±2.83): MSE ≈ 0.0345

• Lloyd-Max optimal: MSE ≈ 0.0335

• Improvement: ~3% for Gaussian (modest, because Gaussian isn't that peaked)

For a Laplacian source (speech-like) with 3 bits:

• Best uniform: MSE ≈ 0.0520

• Lloyd-Max optimal: MSE ≈ 0.0465

• Improvement: ~11% (much more significant for peaked distributions)

The more non-uniform the source, the more non-uniform quantization helps. This is why the payoff is especially large for LLM weight compression (Chapter 8).

Pilanci's theoretical result preview: In the high-rate regime (many levels), the asymptotic optimal density of quantization centroids is proportional to f_X(x)^1/3. This means: put more levels where the pdf is high, but NOT proportionally to the pdf. The cube root "softens" the allocation. We'll derive this in Chapter 6 — it's one of the deepest results in quantization theory.

Uniform vs. Non-Uniform: The Waste Problem

A Gaussian pdf with M=8 levels. Toggle between uniform and adapted spacing. Watch how error concentrates near zero for the uniform case.

Showing uniform quantizer. Notice how bins in the tails are nearly empty.

Why does a uniform quantizer perform poorly on speech signals?

Speech signals have too high a sample rate for uniform bins Most speech amplitudes cluster near zero, so uniform bins waste resolution in the rarely-visited tails Uniform quantizers cannot handle negative values

Chapter 1: General Quantizer Model

Before we can optimize anything, we need precise notation. A scalar quantizer is defined by two sets of numbers: where to cut the real line, and where to reconstruct within each cut.

Thresholds and Reconstruction Levels

A scalar quantizer with M levels is specified by:

Decision thresholds: t₀ < t₁ < ··· < t_M, where t₀ = −∞ and t_M = +∞.

Reconstruction levels (codepoints): y₀, y₁, ..., y_M−1.

The mapping rule is:

Q(x) = y_k if x ∈ [t_k, t_k+1)

That's it. Every input x falls into exactly one interval [t_k, t_k+1), and we output the corresponding y_k. The quantization error is ε = X − Q(X).

Think of it this way: The thresholds are fences that partition the real line into M "cells." Each cell has exactly one representative point y_k. Any input landing in that cell gets mapped to that point. The question is: where should we place the fences and the representatives?

Voronoi Cells on the Real Line

In one dimension, the cells are simply intervals. But the terminology matters: each interval [t_k, t_k+1) is the Voronoi cell of its reconstruction level y_k. In higher dimensions (vector quantization), these cells become complex polytopes. On the real line, they're just contiguous segments.

The key geometric insight: if we fix the reconstruction levels {y_k} and use squared error, the optimal threshold between y_k−1 and y_k is their midpoint. This is the nearest-neighbor rule — assign each x to whichever y_k is closer. On the real line, "closer" with squared error means the boundary is exactly halfway between adjacent levels.

Why the midpoint? If y_k−1 = −1 and y_k = 0.5, the midpoint is −0.25. For any x < −0.25, the squared distance to y_k−1 = −1 is smaller than to y_k = 0.5. For x > −0.25, it's the reverse. The midpoint is where both squared distances are equal: (x − y_k−1)² = (x − y_k)².

Quantizer Anatomy

Drag the reconstruction levels (orange dots) to see how thresholds (dashed lines) and cells change. The thresholds always sit at midpoints between adjacent levels.

Drag orange dots to rearrange levels. Thresholds update automatically.

The Quantization Error

The quantization error (or quantization noise) is the difference between the input and its quantized version:

ε = X − Q(X)

For a well-designed quantizer, this error has several properties:

• It's bounded: |ε| ≤ max cell width / 2 (for cells that don't extend to ±∞).

• For the overload region (tail cells extending to ∞), the error is unbounded but unlikely (low-probability tail events).

• The error is NOT uniformly distributed (unlike the approximation used for uniform quantizers in Lecture 3). For non-uniform quantizers, the error distribution varies by cell.

Encoder vs. Decoder

In practice, the quantizer has two parts:

• Encoder: Given input x, find which cell it belongs to: k = arg min_j |x − y_j| (or equivalently, find k such that t_k ≤ x < t_k+1). Transmit the index k using b = log₂(M) bits.

• Decoder: Given index k, output the reconstruction level y_k. This requires only a lookup table of M entries.

The encoder needs to know all thresholds (for comparison). The decoder only needs the M reconstruction levels. Both are stored as tables — no computation beyond comparison (encoder) or lookup (decoder).

Fixed-Rate vs. Variable-Rate

For fixed-rate quantization, we use M = 2^b levels, where b is the number of bits per sample. With b = 3 bits, we get M = 8 levels. Each quantized sample is encoded with exactly log₂(M) bits. (Later in the course, entropy coding allows variable-rate encoding where some symbols use fewer bits than others.)

Fixed-rate is simpler (constant bit rate, no buffer management), but variable-rate combined with entropy coding can achieve lower distortion at the same average bit rate. For this lecture, we focus on fixed-rate design.

How Many Parameters Do We Design?

For M = 2^b levels, the design has:

• M reconstruction levels: y₀, ..., y_M−1

• M − 1 interior thresholds: t₁, ..., t_M−1 (t₀ = −∞ and t_M = +∞ are fixed)

• Total: 2M − 1 free parameters

For b = 3 bits: 15 parameters. For b = 4 bits: 31 parameters. For a symmetric pdf, symmetry halves this. But even with exploiting symmetry, there's no closed-form solution for most distributions — we need the iterative algorithm from Chapter 4.

A 4-bit scalar quantizer has how many reconstruction levels?

4 8 16 32

Chapter 2: The Distortion Objective

We know what a quantizer looks like. Now we need to say what makes one quantizer better than another. The answer: mean squared error (MSE) distortion.

The MSE Formula

Given a random variable X with pdf f_X(x), the distortion of a quantizer Q is:

D ≜ E[(X − Q(X))²] = ∑_k=0^M−1 ∫_{t_k}^t_k+1 (x − y_k)² f_X(x) dx

Let's unpack this. The overall distortion is the sum of distortions from each cell:

D_k = ∫_{t_k}^t_k+1 (x − y_k)² f_X(x) dx

Each D_k measures: within cell k, how far are the signal values from the reconstruction level y_k, weighted by how likely those values are? If f_X(x) is large in a cell, errors there hurt more. If f_X(x) is tiny, even large (x − y_k)² contributes little to D.

The design problem: Given M = 2^b levels and a known pdf f_X, choose {t_k} and {y_k} to minimize D. This is a constrained optimization over 2M − 1 free parameters (M levels + M − 1 interior thresholds).

Worked Example: Computing D for a Specific Quantizer

Let's compute the distortion of a simple 2-level quantizer on N(0,1) with levels at y₀ = −1 and y₁ = +1, threshold at t₁ = 0.

D = ∫_−∞⁰ (x−(−1))² φ(x) dx + ∫₀^∞ (x−1)² φ(x) dx

By symmetry, both integrals are equal. Focus on the second:

∫₀^∞ (x−1)² φ(x) dx = ∫₀^∞ (x² − 2x + 1) φ(x) dx

We need three integrals over the half-line:

• ∫₀^∞ x²φ(x)dx = 1/2 (half the variance, since E[X²]=1)

• ∫₀^∞ xφ(x)dx = 1/√(2π) ≈ 0.3989

• ∫₀^∞ φ(x)dx = 1/2

Combining: 1/2 − 2/√(2π) + 1/2 = 1 − √(2/π) ≈ 0.2026. Total D = 2 × 0.2026 = 0.4053.

Compare this to the optimal 2-level quantizer (y₀ = −0.798, y₁ = 0.798) which achieves D = 1 − 2/π ≈ 0.3634. Our naive choice of ±1 is 11% worse than optimal. Even this small example shows how level placement matters.

Why It's Hard: Non-Convexity

Here's the bad news: D is not convex in the joint parameters {t_k, y_k}. There can be multiple local minima. Different initializations lead to different solutions. The globally optimal quantizer may be hard to find.

Here's the good news: D has simple necessary conditions for a local optimum. These conditions give us an iterative algorithm that always converges (to some local minimum), and for nice distributions (log-concave, like the Gaussian), the local minimum is unique and therefore global.

Think of it like finding the lowest valley in a mountain range. You can't see all the valleys at once (non-convex), but you can always walk downhill (iterative algorithm). For Gaussian-shaped landscapes, there's only one valley (uniqueness), so downhill always leads to the global minimum.

Counting the Local Minima

How bad is the non-convexity? For symmetric unimodal distributions, things are manageable. The number of "qualitatively different" local optima grows slowly with M. For a bimodal distribution (mixture of two Gaussians), a 4-level quantizer might have 3 local optima: (2+2 split), (1+3 split), (3+1 split). For log-concave unimodal distributions, there is only one.

The practical implication: for Gaussian and Laplacian sources (the most common cases in signal processing and ML), Lloyd-Max with any reasonable initialization finds the global optimum. For multimodal distributions, multiple restarts or smart initialization (from the high-rate compander) are needed.

The Non-Convexity in Detail

Why isn't D convex jointly in {t_k, y_k}? Consider: D is convex in {y_k} for fixed {t_k} (each D_k is quadratic in y_k). And D is convex in {t_k} for fixed {y_k} (the nearest-neighbor assignment is optimal). But jointly? Moving a threshold and a level simultaneously can create interactions that produce saddle points and local minima. The alternating minimization sidesteps this by decomposing into two individually convex subproblems.

Distortion Landscape

For a simple M=2 quantizer on a Gaussian, we can plot D as a function of the single reconstruction level y₁ (with y₀ fixed). The landscape is smooth but can have multiple local minima for larger M.

y₀ -1.5

Why can't we simply take the derivative of D and solve for the optimal quantizer in closed form?

The integral has no analytic solution The problem is non-convex with coupled variables, so setting all derivatives to zero gives necessary but not sufficient conditions, and the resulting equations are implicit The pdf f_X is unknown

Chapter 3: Lloyd-Max Conditions

Despite the non-convexity, we can derive two elegant necessary conditions for any locally optimal quantizer. These are the Lloyd-Max conditions, named after Stuart Lloyd (Bell Labs, 1957/1982) and Joel Max (1960).

Condition C1: The Centroid Condition

Fix the thresholds {t_k}. Since the cells are disjoint, we can minimize each D_k separately. D_k(y_k) is a convex quadratic in y_k, so we differentiate and set to zero:

∂D_k/∂y_k = ∫_{t_k}^t_k+1 2(y_k − x) f_X(x) dx = 0

Expanding:

y_k ∫_{t_k}^t_k+1 f_X(x) dx = ∫_{t_k}^t_k+1 x f_X(x) dx

Solving for y_k:

y_k = E[X | X ∈ [t_k, t_k+1)] = ∫_{t_k}^t_k+1 x f_X(x) dx ⁄ ∫_{t_k}^t_k+1 f_X(x) dx

C1 in words: Each reconstruction level must be the centroid (conditional mean) of its cell. Place each y_k at the "center of mass" of the probability in its interval. This makes perfect intuitive sense — the best single representative of a region is its average value, weighted by likelihood.

Condition C2: The Nearest-Neighbor Boundary

Now fix the codepoints {y_k} and optimize the thresholds. Only two cells depend on an interior boundary t_k:

D_k−1 + D_k = ∫_{t_k−1}^t_k (x − y_k−1)² f_X(x) dx + ∫_{t_k}^t_k+1 (x − y_k)² f_X(x) dx

Differentiate with respect to t_k using Leibniz's rule (the integrand evaluated at the boundary):

∂(D_k−1 + D_k)/∂t_k = (t_k − y_k−1)² f_X(t_k) − (t_k − y_k)² f_X(t_k) = 0

Assuming f_X(t_k) > 0, we divide it out:

(t_k − y_k−1)² = (t_k − y_k)²

Since y_k−1 < y_k for an ordered quantizer, the only valid root is:

t_k = (y_k−1 + y_k) / 2, k = 1, ..., M−1

C2 in words: Each threshold sits at the midpoint between its two neighboring reconstruction levels. At the boundary, both neighbors are equally good — the signal is equidistant from y_k−1 and y_k. This is the nearest-neighbor rule: assign each x to the closest codepoint.

Worked Example: M=2 for N(0,1)

Let's verify the conditions by hand for the simplest non-trivial case: M=2 levels on a standard Gaussian. By symmetry, we expect y₀ < 0 < y₁ and t₁ = 0.

Step 1 (C2): The single interior threshold is t₁ = (y₀ + y₁)/2. By symmetry of N(0,1), we expect y₁ = −y₀, so t₁ = 0. Check.

Step 2 (C1): For cell 1 with t₁=0 and t₂=+∞:

y₁ = E[X | X ≥ 0] = ∫₀^∞ x φ(x) dx / ∫₀^∞ φ(x) dx

The denominator is 1/2 (half the Gaussian). The numerator is ∫₀^∞ x φ(x) dx = 1/√(2π) (evaluate by substitution u = x²/2). Therefore:

y₁ = (1/√(2π)) / (1/2) = √(2/π) ≈ 0.7979

And by symmetry, y₀ = −√(2/π) ≈ −0.7979. The MSE-optimal 1-bit quantizer for N(0,1) reconstructs at ±0.798.

The distortion: D = E[(X − Q(X))²] = E[X²] − 2E[X · Q(X)] + E[Q(X)²]. Since E[X²] = 1 (unit variance) and Q(X) = sign(X) · √(2/π):

• E[Q(X)²] = 2/π (always takes value ±√(2/π))

• E[X · Q(X)] = √(2/π) · E[|X|] = √(2/π) · √(2/π) = 2/π

• D = 1 − 2(2/π) + 2/π = 1 − 2/π ≈ 0.3634

Compare to uniform with the same range: if we put levels at ±0.5, distortion is about 0.5. Lloyd-Max saves ~27% even with just 2 levels. For more levels, the absolute improvement is smaller (as a percentage) but the principle remains: centroids always beat arbitrary placements.

This worked example demonstrates how both Lloyd-Max conditions work together. The threshold t₁=0 is the midpoint between y₀ and y₁ (C2 satisfied). The levels ±√(2/π) are the conditional means of [−∞,0) and [0,+∞) respectively (C1 satisfied). Both conditions hold simultaneously — it's a fixed point of the alternating algorithm, achieved in one step due to the symmetry of this simple case.

Important Caveats

These conditions are necessary but not sufficient for optimality. There may be multiple quantizers satisfying both C1 and C2 simultaneously — some are local minima, not global. However, for log-concave densities (where log f_X(x) is concave — this includes Gaussian, Laplacian, uniform, logistic), the optimal quantizer is unique.

C1 and C2 Visualized

Green lines show thresholds (midpoints between levels). Orange dots show reconstruction levels (centroids of their cells). Toggle to see what happens when conditions are violated.

Showing a quantizer satisfying both Lloyd-Max conditions.

What does the centroid condition (C1) require?

Each reconstruction level y_k equals the conditional mean E[X | X in cell k] Each threshold is the midpoint between adjacent levels The reconstruction levels are uniformly spaced

Chapter 4: The Lloyd-Max Algorithm

The two conditions C1 and C2 suggest an obvious strategy: alternate between them. Fix thresholds, update levels (C1). Fix levels, update thresholds (C2). Repeat. This is the Lloyd-Max algorithm — an alternating minimization that monotonically decreases distortion at every step.

The Algorithm

Initialize

Choose starting levels y₀ < ... < y_M−1 (e.g., uniform spacing or random)

↓

Boundary Update (C2)

t_k ← (y_k−1 + y_k)/2 for k=1,...,M−1

↓

Centroid Update (C1)

y_k ← E[X | X ∈ [t_k, t_k+1)] for k=0,...,M−1

↻ repeat until convergence

Convergence guarantee: Each step cannot increase D (it's coordinate descent on a bounded-below function). The algorithm converges to a stationary point (local minimum). For log-concave pdfs, this is the global minimum.

Why monotone decrease? After the boundary update (fixing boundaries at midpoints), D can only decrease or stay the same (we've optimally reassigned points). After the centroid update (fixing levels at conditional means), D can only decrease or stay the same (centroids minimize squared error within each cell). Two non-increasing steps composed = monotone non-increasing sequence. Since D ≥ 0, it must converge.

Connection to 1D k-means

When you have empirical data samples {x₁, ..., x_n} instead of a known pdf, replace integrals with sums. The centroid update becomes a sample mean over assigned points. The boundary update assigns each point to the nearest level. This is exactly the k-means algorithm with M clusters in one dimension.

D̂ = (1/n) ∑_i=1ⁿ (x_i − y_a(i))²

Where a(i) ∈ {0, ..., M−1} is the assignment of sample i to its nearest level. k-means alternates: (1) assign each x_i to the nearest y_k, (2) recompute y_k as the mean of assigned points. In 1D, assignment step = boundary update. Mean step = centroid update. Lloyd-Max IS 1D k-means.

The deep connection: The Lloyd-Max algorithm (1957) and k-means clustering (Lloyd, 1982) are the same algorithm. Quantizer design = clustering. This isn't a coincidence — both solve "find M representatives that minimize squared-error distortion." The difference is just whether you work with a known pdf (integrals) or empirical samples (sums).

This connection runs deep. In machine learning, k-means is usually presented as a clustering algorithm — group similar data points together. In signal processing, Lloyd-Max is a quantizer design algorithm — find M representative values to minimize reconstruction error. They are literally the same mathematical problem:

min_{{y_k}} E[min_k (X − y_k)²]

The "assignment step" in k-means (assign each point to its nearest center) is the boundary update (C2). The "update step" (recompute centers as means) is the centroid update (C1). Every result about k-means (convergence, local minima, initialization strategies like k-means++) applies to quantizer design, and vice versa.

Implementation Notes

For analytic pdfs (Gaussian, Laplacian), the integrals in the centroid update may need numerical quadrature (Gauss-Legendre, Simpson's rule, or scipy.integrate.quad). For empirical data, it's just arithmetic: y_k = mean of all samples assigned to cell k.

Initialization strategies:

• Uniform spacing: Place M levels uniformly in [−3σ, 3σ]. Simple but may be far from optimal for peaked distributions.

• High-rate compander init: Use t_k = g⁻¹(k/M) from the analytic compressor. This starts near the asymptotic optimum.

• Equal-probability init: Place thresholds so each cell has probability 1/M. Good for highly non-uniform pdfs.

• Multiple random restarts: For non-log-concave distributions, run from many initializations and keep the best.

Exploiting symmetry: For symmetric pdfs, we know the optimal quantizer is symmetric (for even M: t_M/2 = 0, y_k = −y_M−1−k). This halves the number of unknowns and guarantees the correct structure.

Stopping criterion: Stop when |D⁽ⁿ⁾ − D⁽ⁿ⁻¹⁾| < ε (typically ε = 10⁻⁸), or when max_k|y_k⁽ⁿ⁾ − y_k⁽ⁿ⁻¹⁾| < ε. In practice, 10-20 iterations usually suffice for log-concave distributions.

Lloyd-Max Convergence Simulator

Watch the algorithm converge on a Gaussian source. Click "Iterate" to step through. Try different initializations to see local minima. The MSE plot shows monotone decrease.

Levels M

Distribution

Iteration 0 | MSE: — | Click "Iterate" to begin.

Why is the Lloyd-Max algorithm equivalent to 1D k-means?

Both alternate between assigning points to nearest centers and recomputing centers as means — in 1D, assignment boundaries are midpoints (C2) and means are centroids (C1) Both use gradient descent on the same loss function Both require the data to be Gaussian

Chapter 5: Solutions for Gaussian & Other Distributions

Running Lloyd-Max numerically for specific distributions gives us tables of optimal thresholds and levels. These are the "gold standard" quantizers for their respective distributions. Let's look at the most important cases.

Standard Normal: N(0,1) with M=8 (3 bits)

For the Gaussian, the optimal quantizer is symmetric around zero (because the pdf is symmetric). Lloyd-Max converges to a unique solution (Gaussian is log-concave):

k	Threshold t_k	Level y_k
0	t₀ = −∞	y₀ ≈ −2.1519
1	t₁ ≈ −1.7479	y₁ ≈ −1.3439
2	t₂ ≈ −1.0500	y₂ ≈ −0.7560
3	t₃ ≈ −0.5005	y₃ ≈ −0.2451
4	t₄ = 0	y₄ ≈ 0.2451
5	t₅ ≈ 0.5005	y₅ ≈ 0.7560
6	t₆ ≈ 1.0500	y₆ ≈ 1.3439
7	t₇ ≈ 1.7479	y₇ ≈ 2.1519
8	t₈ = +∞	—

Notice the pattern: inner levels are tightly packed near zero (spacing ~0.49), while outer levels are farther apart (spacing ~0.81). The quantizer allocates more resolution where the Gaussian has more probability mass.

Unit-Variance Laplacian with M=8

The Laplacian f(x) = (√2/2) e^−√2|x| is even more peaked at zero than the Gaussian:

k	Threshold t_k	Level y_k
0	t₀ = −∞	y₀ ≈ −3.0867
1	t₁ ≈ −2.3796	y₁ ≈ −1.6725
2	t₂ ≈ −1.2527	y₂ ≈ −0.8330
3	t₃ ≈ −0.5332	y₃ ≈ −0.2334
4	t₄ = 0	y₄ ≈ 0.2334
5	t₅ ≈ 0.5332	y₅ ≈ 0.8330
6	t₆ ≈ 1.2527	y₆ ≈ 1.6725
7	t₇ ≈ 2.3796	y₇ ≈ 3.0867
8	t₈ = +∞	—

Comparing Gaussian vs. Laplacian: The Laplacian has tighter inner thresholds (more resolution near 0) and more spread-out outer levels (heavier tails need wider outer cells). The Laplacian's peak is sharper, so it needs even more non-uniformity than the Gaussian.

Key Observations

Symmetry: Both designs are symmetric because their pdfs are symmetric. This cuts the design problem in half — we only need to find the positive thresholds and levels.

Uniform source: For X ~ Uniform, the optimal quantizer is uniform. The theory is self-consistent: when the data is uniform, non-uniformity buys nothing.

Log-concave uniqueness: Gaussian, Laplacian, uniform, and logistic distributions all have log-concave densities. For these, Lloyd-Max converges to the unique global optimum from any initialization. For multimodal distributions, multiple local optima may exist.

Spacing Analysis: How Non-Uniform Are These Quantizers?

Let's measure how far from uniform the optimal quantizers are. For the Gaussian with M=8:

Cell	Width (t_k+1−t_k)	Probability mass	Ratio to uniform
Inner (k=3,4)	0.501	0.191	0.59×
Mid (k=2,5)	0.550	0.153	0.65×
Outer (k=1,6)	0.698	0.103	0.82×
Tail (k=0,7)	∞ (clipped ~3.5)	0.053	N/A

The inner cells are ~40% narrower than uniform, capturing ~19% of the probability each. The tail cells capture only ~5% of probability but span an infinite range. A uniform quantizer would give each cell 12.5% probability — wildly inefficient allocation for Gaussian data.

For the Laplacian, the non-uniformity is even more extreme: the innermost cells are ~47% narrower than uniform, while the tail cells span a much wider range due to the heavier exponential tails.

Optimal Levels Comparison

Compare Lloyd-Max optimal levels for Gaussian vs. Laplacian. Notice how Laplacian packs levels tighter near zero.

For which class of distributions does Lloyd-Max guarantee a unique global optimum?

All continuous distributions Log-concave distributions (Gaussian, Laplacian, uniform, logistic) Only symmetric distributions

Chapter 6: High-Rate Theory

Lloyd-Max gives us the optimal quantizer for any fixed M, but requires numerical iteration. What if M is very large? Can we get an analytic formula for how the levels should be distributed? The answer is yes, and it reveals a beautiful and surprising result.

The High-Rate Approximation

When M is large, each cell is narrow. Within a narrow cell, we can assume the pdf is approximately constant — the quantizer is "locally uniform." Let Δ(x) be the local cell width at position x. The distortion from a cell of width Δ centered at x, where the pdf is approximately f_X(x), is:

D ≈ (1/12) ∫ Δ(x)² f_X(x) dx

This is Bennett's approximation. Compare to the uniform case: Δ(x) ≡ Δ gives D ≈ Δ²/12 (the quantization noise formula from Lecture 3). Now Δ(x) varies with x, and we get to choose how.

Point Density Formulation

Instead of cell width, think in terms of point density λ(x):

λ(x) = (fraction of centroids per unit length near x), ∫ λ(x) dx = 1

For M total centroids, the local cell width is Δ(x) ≈ 1/(Mλ(x)). Substituting into the distortion formula:

D ≈ (1/12M²) ∫ f_X(x) / λ(x)² dx

The Optimization: Finding Optimal λ

We minimize D over all valid densities λ(x) ≥ 0 with ∫λ(x)dx = 1. Form the Lagrangian:

L(λ) = ∫ [f_X(x)/λ(x)² + νλ(x)] dx

Take the functional derivative and set to zero (pointwise optimization):

∂/∂λ [f_X/λ² + νλ] = −2f_X/λ³ + ν = 0

Solving:

λ(x)³ ∝ f_X(x) ⇒ λ^☆(x) ∝ f_X(x)^1/3

The 1/3 power law: The optimal density of quantization centroids is proportional to f_X(x)^1/3 — NOT f_X(x) itself! This is one of the most elegant results in quantization theory. You don't simply place levels proportional to the pdf. The cube root softens the allocation — high-density regions get more levels, but not as many as you'd naively expect.

Equivalently, the optimal cell width is:

Δ^☆(x) ∝ f_X(x)^−1/3

Concrete Examples

Gaussian N(0,1): f_X(x)^1/3 ∝ e^−x²/6, which is itself a Gaussian with variance 3. So λ^☆(x) = N(0, 3). The optimal high-rate compressor is:

g(x) = Φ(x/√3), thresholds: t_k ≈ √3 · Φ⁻¹(k/M)

Compare to equal-probability quantization where t_k = Φ⁻¹(k/M). The √3 factor makes the high-rate quantizer "less aggressive" — it doesn't pack as much into the tails as equal-probability would suggest.

Laplacian (unit variance): f_X(x)^1/3 ∝ e^−|x|√2/3, so λ^☆ is a Laplace distribution with scale 3/√2. The compressor is piecewise exponential:

g⁻¹(u) = (3/√2) ln(2u) for u < 1/2; −(3/√2) ln(2(1−u)) for u ≥ 1/2

Uniform source: f_X is constant on its support, so f_X^1/3 is also constant. The optimal point density is uniform — confirming that uniform quantization is optimal for uniform sources. The theory is self-consistent.

Why 1/3 and not 1?

Intuition: if you allocate levels proportional to f_X(x), you over-invest in high-density regions. Yes, errors there are weighted more heavily. But making cells very narrow has diminishing returns (distortion goes as Δ²). The cube root balances two competing forces: (1) wanting small cells where the pdf is large, and (2) the quadratic cost of cell width. The exponent 1/3 = 1/(2+1) comes from the trade-off between the squared cell width and the linear density constraint.

More precisely: the distortion contribution from a cell at x scales as Δ(x)² · f_X(x). The constraint is ∑ Δ(x)⁻¹ = M (total number of cells). By the method of Lagrange multipliers, the optimal Δ satisfies 2Δ · f − λ / Δ² = 0, giving Δ ∝ f^−1/3. The "2" from the derivative of Δ² and the "−2" from the constraint's Δ⁻¹ derivative combine to produce the 1/3 exponent.

Here's the full calculation for those who want to see it. We minimize the Lagrangian with respect to the cell width Δ at each point x:

L = ∫ (1/12)Δ(x)²f(x) dx + ν[∫ 1/Δ(x) dx − M]

Taking the functional derivative ∂L/∂Δ(x) = 0:

(1/6)Δ(x)f(x) − ν/Δ(x)² = 0

Solving for Δ(x):

Δ(x)³ = 6ν/f(x) ⇒ Δ(x) = (6ν)^1/3 f(x)^−1/3

The constant (6ν)^1/3 is determined by the constraint ∫ 1/Δ(x) dx = M. Substituting back gives the point density λ(x) = 1/(MΔ(x)) ∝ f(x)^1/3. The derivation is complete.

Note how elegant the result is: a constrained optimization over an infinite-dimensional function space (all possible cell-width functions Δ(x)) reduces to a pointwise algebraic equation. This is the power of the calculus of variations applied to quantizer design.

The formula Δ^☆(x) ∝ f(x)^−1/3 has a concrete interpretation: where the pdf is 8 times larger, the cell width decreases by factor 8^−1/3 = 1/2. Doubling the pdf at a point decreases the cell width by factor 2^−1/3 ≈ 0.79. The cube root moderates the response to density changes — you don't halve the cell width just because the pdf doubled.

Sanity Check: Equal-Probability vs. f^1/3

A common misconception: "make all cells equally probable" (each cell has probability 1/M). This corresponds to λ(x) ∝ f_X(x), i.e., the 1/1 power, not 1/3. Equal-probability quantization is NOT optimal for MSE! It over-resolves near the peak and under-resolves in the tails. The 1/3 power law is the correct trade-off.

For Gaussian with M=8: equal-probability thresholds are at Φ⁻¹(k/8) = {−1.15, −0.67, −0.32, 0, 0.32, 0.67, 1.15}. The Lloyd-Max thresholds {−1.75, −1.05, −0.50, 0, 0.50, 1.05, 1.75} are more spread out. The f^1/3 thresholds at √3 · Φ⁻¹(k/8) = {−1.99, −1.17, −0.55, 0, 0.55, 1.17, 1.99} are even wider. As M grows, the high-rate thresholds converge to the Lloyd-Max solution.

Three threshold strategies compared: Equal-probability concentrates at the center (too aggressive). Uniform spacing ignores the pdf entirely (too conservative). The f^1/3 rule is the Goldilocks solution — it allocates more resolution where probability is high, but moderates the allocation via the cube root so the tails aren't neglected.

Zador's Formula (Panter-Dite)

Substituting λ^☆ back into the distortion gives the asymptotic MSE:

D^☆(b) ≈ (2^−2b/12) [∫ f_X(x)^1/3 dx]³

This is Zador's formula (also called the Panter-Dite formula). Let's unpack what it tells us:

• The 2^−2b factor: Distortion decreases by a factor of 4 for each additional bit (6 dB per bit). This is the same scaling as uniform quantization — adding a bit always halves the cell widths, quartering the squared error.

• The 1/12 factor: Same as the uniform quantizer's Δ²/12 formula. This is the variance of a uniform distribution on [−Δ/2, Δ/2].

• The shape factor [∫f^1/3dx]³: This depends only on the distribution's "shape" (invariant to scaling). It measures how much the distribution deviates from uniform. For a uniform distribution on [0,1], the integral is 1, and the formula gives Δ²/12. For Gaussian, the integral is (2π)^1/3 · 3^1/2 ≈ 2.72, giving D^☆ ≈ 1.68 · 2^−2b.

Comparison with uniform quantizer: For Gaussian with optimized clipping, the best uniform quantizer achieves D_uniform ≈ 1.96 · 2^−2b. The optimal non-uniform quantizer achieves D^☆ ≈ 1.68 · 2^−2b. The ratio is about 0.86 — non-uniform quantization gives a 14% improvement asymptotically, which translates to ~0.22 bits of advantage (the non-uniform quantizer at b bits matches the uniform quantizer at b+0.22 bits).

Shape Factors for Common Distributions

The quantity γ = [∫f_X(x)^1/3dx]³ is the shape factor (or "non-uniformity penalty"). It determines how much harder a distribution is to quantize compared to uniform:

Distribution	γ	D^☆(b) / D_{uniform-on-same-support}
Uniform	1	1 (optimal = uniform)
Gaussian N(0,1)	πe/3 ≈ 2.84	0.86
Laplacian (unit var.)	9e²/4 ≈ 16.6	0.70
Exponential	27e²/4 ≈ 49.7	0.56

The Laplacian has a much larger shape factor than the Gaussian, meaning non-uniform quantization gives a bigger payoff. The exponential (one-sided heavy tail) benefits even more. The uniform distribution has γ = 1, confirming that no improvement over uniform quantization is possible.

The 6 dB Per Bit Rule

From Zador's formula, each additional bit reduces D by a factor of 4 (since 2^−2(b+1)/2^−2b = 1/4). In dB: 10 log₁₀(4) ≈ 6.02 dB per bit. This rule applies to BOTH uniform and optimal non-uniform quantizers — the shape factor γ is a constant multiplier that doesn't change with b. The slope of the distortion-rate curve is universal; only the intercept depends on the distribution.

The 1/3 Power Law

Compare f_X(x) (the pdf) vs. f_X(x)^1/3 (the optimal point density). Notice how the 1/3 power "flattens" the distribution — less extreme concentration near the peak.

Distribution

In the high-rate regime, the optimal centroid density is proportional to:

f_X(x) — place levels proportional to the pdf f_X(x)^(1/3) — the cube root of the pdf f_X(x)^(1/2) — the square root of the pdf Uniform — the same everywhere

Chapter 7: Companding: μ-law & A-law

The high-rate theory gives us the optimal point density λ^☆(x) ∝ f_X(x)^1/3. But how do we build this in practice? The answer is companding: compress, quantize uniformly, then expand.

The Compressor Function

Define a monotone compressor function g(x) whose derivative equals the optimal point density:

g'(x) = λ^☆(x) = f_X(x)^1/3 / ∫ f_X(u)^1/3 du

Then:

g(x) = ∫_−∞^x g'(u) du ∈ [0, 1]

The key insight: if we quantize u = g(x) uniformly into M bins (each of width 1/M), the thresholds in the original x-domain are:

t_k ≈ g⁻¹(k/M), k = 0, ..., M

Since g'(x) = λ^☆(x), regions where the pdf is high get compressed into a small range of u, so the uniform quantizer in u-space places more levels there in x-space. The compressor "warps" the signal axis so that optimal non-uniform bins become uniform bins after transformation.

The companding recipe: (1) Compress: u = g(x). (2) Quantize u uniformly with M levels. (3) Expand: x̂ = g⁻¹(quantized u). The non-uniformity lives entirely in g. The quantizer itself is the simple uniform one we already know.

μ-law Companding

In practice, telephony systems don't compute f_X(x)^1/3 — they use a parametric family of compressors. The most famous is μ-law:

g(x) = sgn(x) · ln(1 + μ|x|/X_max) / ln(1 + μ)

Where μ controls the degree of compression. Large μ means more non-uniformity (stronger compression near zero). The standard for North America and Japan uses μ = 255.

The process:

Compress

u = g(x): map signal through μ-law compressor

↓

Quantize

q = Q_uniform(u): quantize u with equal-width bins

↓

Expand

x̂ = g⁻¹(q): map back through inverse compressor

A-law Companding

A-law is the European standard (A = 87.6). It uses a piecewise linear/logarithmic compressor:

g_A(x) = A|x|/(1+ln A) for |x| ≤ 1/A; (1+ln(A|x|))/(1+ln A) for |x| > 1/A

Compared to μ-law, A-law provides better resolution for low-level signals (quiet speech). μ-law has better dynamic range for loud signals but worse distortion for quiet passages. The trade-off:

Property	μ-law (μ=255)	A-law (A=87.6)
Region	North America, Japan	Europe, international
Small-signal quality	Good	Better
Dynamic range	Better	Good
Idle channel noise	Higher	Lower
Compression law	Smooth logarithmic	Piecewise linear/log

Both standards use 8 bits (256 levels) after compression. The compressed signal is called PCM (Pulse Code Modulation) and has been the backbone of digital telephony since the 1960s. Every phone call you've ever made uses companding.

Why Companding Works: An Intuitive Explanation

Think of the compressor as a "pdf equalizer." The function g(x) maps the input distribution f_X(x) into a distribution that is closer to uniform in the compressed domain. Once the signal is approximately uniform, a uniform quantizer is approximately optimal. The expansion step g⁻¹ then maps the uniformly-quantized values back to the original domain, producing the non-uniform spacing we want.

The beauty: you get non-uniform quantization using only three simple operations (compress, quantize uniformly, expand). No need for complex decision logic. The non-uniformity is "baked into" the compressor function. This is why companding was practical even with 1960s analog electronics — diodes and op-amps can implement logarithmic curves cheaply.

Verifying the Compressor Design

Let's verify that μ-law with μ=255 approximates the high-rate optimal compressor for speech. Speech amplitudes are roughly Laplacian. The optimal g'(x) ∝ f(x)^1/3 ∝ e^−|x|/(3b) for a Laplacian with scale b. The μ-law compressor g'(x) = μ/(x(1+μ)ln(1+μ)) is approximately exponential for small x and flattens for large x. With μ=255, the match to the Laplacian f^1/3 compressor is quite good over the speech amplitude range, which is why it was standardized.

Closed-Form Compressors from High-Rate Theory

For specific distributions, we can compute g(x) analytically:

Gaussian N(0,1): Since f_X(x)^1/3 ∝ e^−x²/6, the optimal λ^☆ is itself Gaussian with variance 3. The compressor is:

g(x) = Φ(x/√3), g⁻¹(u) = √3 · Φ⁻¹(u)

High-rate thresholds: t_k ≈ √3 · Φ⁻¹(k/M). Compare to equal-probability quantization (t_k = Φ⁻¹(k/M)): the √3 factor makes the compander "less aggressive" near the tails.

Laplacian (unit variance): With f_X(x)^1/3 ∝ e^{−|x|/(√2 · 3)}, the compressor is piecewise exponential:

g⁻¹(u) = (3/√2) ln(2u) for u < 1/2; −(3/√2) ln(2(1−u)) for u ≥ 1/2

Other Practical Non-Uniform Quantizers

Dead-zone quantizers: In transform coding (JPEG, audio codecs), transformed coefficients (DCT, wavelets) are peaked at zero. A dead-zone quantizer uses a larger bin around 0: any value in [−δ, δ] maps to 0. This encourages sparsity — many zeros in the coefficient stream make it highly compressible by entropy coding. The dead-zone width δ is a design parameter that trades off between distortion and compressibility.

Log quantizers: Some applications care about relative error |ε|/|x| rather than absolute error. For these, quantize in log domain:

Q(x) = exp(Q_uniform(ln x)) for x > 0

This gives roughly constant relative error across the dynamic range. It connects to floating-point and block floating-point formats used in some ML activation quantizers. The mantissa provides uniform quantization in log space.

Connection to floating point: IEEE FP16 is essentially a log quantizer. The exponent selects a scale (power of 2), and the mantissa provides uniform quantization within that scale. The result: relative precision is roughly constant across 5 orders of magnitude. FP16 has "non-uniform" step sizes by design — tiny steps near zero, huge steps near the maximum representable value. This is companding built into the number format itself.

Block floating point: Modern ML quantization (e.g., Microsoft's MSFP8) shares one exponent across a block of values (32 or 128 consecutive weights). This is a compromise between per-value floating point (expensive) and fixed-point (single scale for everything). The shared exponent acts as a per-block companding operation.

Design pattern: All practical non-uniform quantizers follow the same meta-recipe, regardless of the specific application:

1. Observe

Measure or model the signal statistics (pdf or histogram)

↓

2. Design

Choose compressor g (or directly compute levels via Lloyd-Max)

↓

3. Quantize

Apply the designed quantizer (either via companding or direct lookup)

↓

4. Optionally code

Entropy code the indices if variable rate is acceptable

The theory from Chapters 3-6 tells us how to do Step 2 optimally. The rest is engineering.

μ-law Compressor

The compressor g(x) for different values of μ. Higher μ = more compression near zero = more non-uniform quantization. At μ=0, you get a straight line (uniform quantizer).

μ 255

What does the compressor g(x) in a companding system do?

It reduces the number of quantization levels needed It warps the signal axis so that a subsequent uniform quantizer produces non-uniform bins in the original domain It removes noise from the signal before quantization

Chapter 8: NormalFloat & LLM Quantization

Here's where 1982 meets 2023. Everything we've learned — Lloyd-Max conditions, optimal centroid placement for Gaussian distributions — directly enables the compression of billion-parameter language models.

The Weight Distribution of LLMs

When you train a large language model (GPT, LLaMA, Mistral), the resulting weight matrices have a striking statistical property: their values are approximately normally distributed. Centered near zero, bell-shaped, symmetric. This has been observed repeatedly across architectures and scales.

This means: if you want to quantize LLM weights to fewer bits, the optimal quantizer for Gaussian data — exactly what Lloyd-Max gives us — is the right tool.

NF4: NormalFloat 4-bit

NF4 (Dettmers et al., 2023) is a 4-bit quantization format with 16 non-uniform levels. The levels are chosen to be Lloyd-Max optimal for the standard normal distribution. Since LLM weights are approximately Gaussian (after per-channel normalization), NF4 minimizes quantization error for the actual weight statistics.

The direct connection: NF4 = Lloyd-Max quantizer for N(0,1) with M=16. The theory from Lloyd (1982) tells us exactly where to place 16 reconstruction levels to minimize MSE for Gaussian data. That's what NF4 does. A 40-year-old algorithm from Bell Labs now compresses 70B-parameter language models.

How NF4 Differs from Uniform 4-bit

A uniform 4-bit quantizer spreads 16 levels evenly across the range. For Gaussian weights, this wastes levels in the tails (where weights rarely go) and under-resolves near zero (where most weights live).

NF4 instead places levels according to the Lloyd-Max solution: denser near zero, sparser in the tails. The improvement is significant: Dettmers et al. report ~16.6% lower weighted quantization error compared to uniform 4-bit.

Property	Uniform 4-bit	NF4
Number of levels	16	16
Bin widths	Equal (0.133)	Variable (denser near 0)
Optimal for	Uniform distribution	Gaussian distribution
Avg. weighted error	0.0333	0.0278
Used in	Basic PTQ	QLoRA, GPTQ-variants

QLoRA: Putting It All Together

QLoRA (Quantized Low-Rank Adaptation) uses NF4 to compress the base model weights to 4 bits, then fine-tunes only small low-rank adapters in full precision. The key enabling insight: since the base weights are Gaussian, NF4 loses almost no information compared to the theoretical minimum error for 4-bit quantization. The model's capabilities are preserved despite 4x memory reduction.

The pipeline:

Normalize

Per-channel: ŵ = w/σ_channel (make weights ~ N(0,1))

↓

Quantize (NF4)

Map each normalized weight to nearest of 16 Lloyd-Max optimal levels

↓

Fine-tune (LoRA)

Train low-rank adapters at full precision on task data

Why 16.6% Improvement Matters at Scale

A 16.6% reduction in quantization error might sound modest. But consider what it means at the scale of LLMs:

• A 70B-parameter model has ~70 billion weights. At 4 bits each, that's ~35 GB. The quantization error is accumulated across billions of multiply-add operations during inference.

• Lower per-weight error means the quantized model's outputs stay closer to the full-precision model. In practice, this translates to measurably better perplexity: QLoRA with NF4 matches full 16-bit fine-tuning quality on many benchmarks.

• The cost of NF4 over uniform is zero at inference time — it's just a different lookup table (16 entries). The theoretical benefit is free in practice.

The engineering insight: NF4 isn't "better hardware" or "more compute." It's purely a better allocation of 4 bits, guided by a theory from 1982. The improvement comes from matching the quantizer to the data distribution — exactly what Lloyd and Max told us to do 40+ years ago.

Visualizing LLM Weight Distributions

Pilanci shows weight histograms from Nvidia's Hymba model (November 2024). The pattern is consistent: layer after layer, the weights form near-perfect bell curves. Some layers have slightly heavier tails (attention heads that specialize in rare tokens), but the Gaussian assumption holds remarkably well across architectures — GPT, LLaMA, Mistral, Hymba, Phi.

This universality isn't an accident. It's a consequence of how neural networks are initialized (Gaussian/He/Xavier init) combined with the regularizing effect of training (weight decay pushes weights toward zero, maintaining the bell shape). The central limit theorem also plays a role: each weight is the result of many small gradient updates, which sum to an approximately Gaussian distribution.

Beyond NF4: The Broader Landscape

The principle "match quantizer to data distribution" appears throughout modern ML quantization:

Per-channel scales: Different Δ per output channel (each channel has its own variance).

Dead-zone quantizers: Larger bin around zero for activations with many exact zeros (ReLU outputs).

Log quantizers: For values with heavy tails or when relative error matters more than absolute error.

Learned codebooks: Vector quantization where the codebook is trained on actual weight/activation statistics (LLM.int8(), AQLM, QuIP).

The Quantization Stack in Modern LLMs

A modern LLM quantization pipeline typically combines multiple techniques from this lecture:

Component	Technique	Connection to this lecture
Weight quantization	NF4 or FP4	Lloyd-Max for Gaussian (Chapter 5)
Per-channel scaling	absmax or percentile	Optimized uniform with clipping (Chapter 2)
Activation quantization	INT8 symmetric	Uniform quantizer (Lecture 3)
Outlier handling	Mixed precision	Dead-zone/clipping (Chapter 7)
Fine-tuning adapters	LoRA at FP16	Keep residual capacity in full precision
Double quantization	Quantize the scaling factors	Recursive quantization (same theory, meta-level)

The remarkable fact: the entire theoretical framework for this pipeline — Lloyd-Max conditions, high-rate theory, companding — was developed between 1948 and 1982 for telephone systems. The application to 70B-parameter neural networks was unanticipated, but the mathematics transfers perfectly.

Practical Impact: Memory and Speed

The numbers are dramatic. Consider quantizing LLaMA-2 70B from FP16 to NF4:

Format	Bits/weight	Model size	GPU memory	Quality loss
FP16	16	140 GB	~140 GB (2×A100)	None (baseline)
INT8	8	70 GB	~70 GB (1×A100)	Negligible
NF4	4	35 GB	~35 GB (1×A6000)	Minimal (<0.5 ppl)
NF4 + LoRA	4 + sparse 16	~37 GB	~40 GB (fine-tuning)	Recovered via LoRA

NF4 enables running a 70B model on a single consumer GPU (48GB VRAM). The theoretical guarantee — Lloyd-Max minimizes MSE for Gaussian data — is what makes this work without catastrophic quality loss. You're losing the minimum possible information for 4 bits.

To put this in perspective: before QLoRA (2023), fine-tuning a 65B model required multiple 80GB A100 GPUs costing tens of thousands of dollars per month. After QLoRA with NF4, the same model could be fine-tuned on a single 48GB A6000 (~$1000 consumer card). The insight from Lloyd (1982) — "place your quantization levels at the conditional means of the cells" — democratized access to frontier-scale language models. Theory has practical consequences.

Uniform vs. NF4 for Gaussian Weights

A Gaussian weight distribution with uniform (red) vs. NF4 (green) quantization levels. Notice how NF4 concentrates levels near zero where the density is highest.

Click to compare quantization approaches.

Why is NF4 a good quantization format for LLM weights?

LLM weights are approximately Gaussian, and NF4 uses Lloyd-Max optimal levels for N(0,1), minimizing quantization error for that distribution NF4 uses more bits than other formats NF4 is faster to compute than uniform quantization

Chapter 9: Mastery

Let's consolidate everything. This lecture covered the full arc from "uniform is suboptimal" to "here's how to compress a 70B-parameter LLM." The theoretical backbone is simple: two necessary conditions, an iterative algorithm, and an asymptotic density formula.

Convergence of Lloyd-Max

Why does the algorithm converge? At each iteration:

Boundary update (C2): Given fixed {y_k}, setting t_k = (y_k−1+y_k)/2 minimizes D over thresholds (nearest-neighbor is optimal for fixed codepoints).

Centroid update (C1): Given fixed {t_k}, setting y_k = E[X|X∈cell k] minimizes D over levels (centroid minimizes MSE in each cell independently).

Each step doesn't increase D. Since D ≥ 0, the sequence D⁽⁰⁾ ≥ D⁽¹⁾ ≥ ... ≥ 0 is monotone bounded and must converge. The limit satisfies both C1 and C2 simultaneously (a fixed point of the alternating map).

Rate of Convergence

For log-concave distributions, Lloyd-Max converges linearly (geometric rate): each iteration reduces the gap to the optimum by a constant factor c < 1. Empirically, for Gaussian with M=8:

Iteration	MSE	\|D⁽ⁿ⁾ − D^*\|
0 (init)	0.0520	1.85 × 10⁻²
1	0.0382	4.7 × 10⁻³
2	0.0347	1.2 × 10⁻³
3	0.0338	3.0 × 10⁻⁴
5	0.03352	1.9 × 10⁻⁵
10	0.03350	< 10⁻⁸

The error decreases by roughly 4x per iteration — very fast convergence. In practice, 10 iterations are more than enough for any reasonable application. This is why Lloyd-Max is practical: it's cheap (a few integrals per iteration) and fast to converge.

Global Optimality for Log-Concave Densities

A density f_X is log-concave if log f_X(x) is a concave function. Examples:

• Gaussian: log f = −x²/2 + const (concave parabola)

• Laplacian: log f = −|x|/b + const (concave V-shape)

• Uniform: log f = const (trivially concave)

• Logistic: log f is concave (can verify by checking second derivative)

For log-concave f_X, the Lloyd-Max conditions have a unique solution. There is exactly one quantizer satisfying C1+C2, and it is the global optimum. Any starting point converges to this unique solution. This is a deep result: log-concavity ensures that the distortion landscape, while not convex in general, has no spurious local minima.

Counter-example: a bimodal distribution (e.g., mixture of two Gaussians separated by a gap) is NOT log-concave. For such distributions, a 2-level quantizer might have two locally optimal solutions: (a) one level per mode, or (b) both levels in one mode if the modes have very different heights.

Why This All Matters: The Engineering Perspective

You might ask: if Lloyd-Max only gives 3-14% improvement over uniform for typical distributions, why bother? Three reasons:

1. Scale magnifies small improvements. A 3% MSE reduction on a Gaussian sounds modest. But when quantizing 70 billion weights in an LLM, that 3% translates to measurably better perplexity on downstream tasks. At scale, even small improvements in quantization fidelity compound across billions of operations.

2. The cost is zero at inference. Once you've computed the optimal levels offline, using them is no more expensive than using uniform levels. It's just a different lookup table. Free performance.

3. The improvement grows for peakier distributions. Gaussian is actually the "easy" case. For Laplacian sources (speech, DCT coefficients), the improvement is 11%. For empirical weight distributions with outliers or heavy tails, it can be 15-20%. The worse the match between the data and a uniform distribution, the more Lloyd-Max helps.

The meta-lesson: The entire framework of this lecture — Lloyd-Max, high-rate theory, companding — is about matching the quantizer to the statistics of the data. This principle recurs everywhere in engineering: matched filters, maximum likelihood, sufficient statistics. When your tool matches your data, you waste nothing.

The Complete Picture

Method	How it works	Optimal for	Use case
Uniform	Equal-width bins, midpoint levels	Uniform distribution	General-purpose, simple HW
μ-law	Log compressor + uniform Q	Approx. speech/audio	Telephony (8-bit PCM)
Lloyd-Max	Alternating C1+C2 until convergence	Any known f_X (exact for that M)	Offline design for fixed pdf
High-rate compander	λ^☆ ∝ f_X^1/3, build g	Asymptotically optimal	Good init for Lloyd-Max
NF4	Lloyd-Max for N(0,1), M=16	Gaussian weights	LLM compression (QLoRA)

Design Challenge: Laplacian with M=8

Implement Lloyd-Max for a unit-variance Laplacian source with M=8 levels. Here's the skeleton:

python
import numpy as np
from scipy.integrate import quad
from scipy.stats import laplace

# Unit-variance Laplacian: scale = 1/sqrt(2)
dist = laplace(loc=0, scale=1/np.sqrt(2))

def lloyd_max(M, dist, max_iter=100, tol=1e-8):
    # Initialize: uniform spacing over [-3sigma, 3sigma]
    sigma = dist.std()
    y = np.linspace(-3*sigma, 3*sigma, M)

    for it in range(max_iter):
        # C2: boundary update
        t = np.empty(M + 1)
        t[0], t[-1] = -np.inf, np.inf
        for k in range(1, M):
            t[k] = (y[k-1] + y[k]) / 2

        # C1: centroid update
        y_new = np.empty(M)
        for k in range(M):
            num, _ = quad(lambda x: x * dist.pdf(x), t[k], t[k+1])
            den, _ = quad(dist.pdf, t[k], t[k+1])
            y_new[k] = num / den if den > 0 else (t[k]+t[k+1])/2

        if np.max(np.abs(y_new - y)) < tol:
            break
        y = y_new

    return t, y

t, y = lloyd_max(8, dist)
print("Thresholds:", t[1:-1].round(4))
print("Levels:", y.round(4))

Running this produces the table from Chapter 5. Try modifying it for different distributions (Gaussian, bimodal mixture) to see how the levels adapt.

Extension: The Sample-Based Version (1D k-means)

If you don't know f_X analytically but have N samples, replace integrals with sums:

python
import numpy as np

def lloyd_max_empirical(data, M, max_iter=100, tol=1e-8):
    # This IS 1D k-means
    n = len(data)
    # Initialize: equally spaced over data range
    y = np.linspace(data.min(), data.max(), M)

    for it in range(max_iter):
        # Assignment (C2): assign each sample to nearest level
        dists = np.abs(data[:, None] - y[None, :])  # (N, M)
        assignments = np.argmin(dists, axis=1)

        # Update (C1): each level = mean of assigned samples
        y_new = np.array([
            data[assignments == k].mean() if np.any(assignments == k)
            else y[k]
            for k in range(M)
        ])

        if np.max(np.abs(y_new - y)) < tol:
            break
        y = y_new

    mse = np.mean((data - y[assignments])**2)
    return y, mse

# Example: quantize 10000 Gaussian samples
data = np.random.randn(10000)
levels, mse = lloyd_max_empirical(data, M=8)
print(f"Levels: {levels.round(3)}")
print(f"MSE: {mse:.5f}")

With 10,000 samples, this typically gives levels within 0.01 of the true Lloyd-Max solution. The sample-based approach is crucial for quantizing neural network weights, where the "distribution" is just the empirical histogram of weight values.

Important note: for the empirical version, the centroid update might produce an empty cluster (no samples assigned). Handle this by either keeping the old level, reinitializing from a random sample, or splitting the largest cluster. This is the "dead cluster" problem in k-means, and the same solutions apply.

Computational complexity per iteration: O(n · M) for the assignment step (each of n samples compared to M levels), O(n) for the centroid update (one pass through data). For n = 10⁶ and M = 16, this is about 16 million comparisons per iteration — milliseconds on modern hardware. The algorithm is extremely practical.

Verification: Hand-Computing One Iteration

Let's trace one iteration by hand for M=4 on N(0,1) with initial levels y = [−1.5, −0.5, 0.5, 1.5]:

C2 (boundaries): t₁ = (−1.5+−0.5)/2 = −1.0, t₂ = (−0.5+0.5)/2 = 0, t₃ = (0.5+1.5)/2 = 1.0.

C1 (centroids): For cell 0, [−∞, −1.0]: y₀ = E[X | X < −1] = −φ(1)/Φ(−1) ≈ −0.2420/0.1587 ≈ −1.525. For cell 1, [−1.0, 0]: y₁ = E[X | −1≤X<0] ≈ −0.460. For cell 2, [0, 1.0]: y₂ ≈ 0.460. For cell 3, [1.0, ∞]: y₃ ≈ 1.525.

After one iteration, levels moved from [−1.5, −0.5, 0.5, 1.5] to [−1.525, −0.460, 0.460, 1.525]. The inner levels moved inward (from ±0.5 to ±0.46), concentrating more near zero. A few more iterations and we converge to the optimal [−1.510, −0.453, 0.453, 1.510].

What's happening geometrically? The Gaussian pdf has more mass near zero than near ±1.5. The centroid of [0, 1.0] is pulled below the cell midpoint (toward zero, where more mass is). Similarly, the centroid of [−1.0, 0] is pulled above its midpoint. The outer levels barely move because their cells extend to infinity — the long tail pulls the centroid outward, nearly balancing the inward pull from the cell boundary.

What Convergence Looks Like

In the simulation above, notice the MSE plot. It typically shows:

• Fast initial drop: The first 2-3 iterations capture most of the improvement.

• Exponential tail: Subsequent iterations refine at an exponentially decreasing rate.

• Monotone decrease: MSE never increases (guaranteed by the alternating minimization structure).

• Initialization sensitivity: Try "Random Init" multiple times. For Gaussian, all initializations converge to the same final MSE (uniqueness). For multi-modal distributions, different starts would give different answers.

Distortion Comparison: Three Design Methods

Pilanci compares three constructive designs for the same bit budget:

Method	MSE (3-bit Gaussian)	MSE (3-bit Laplacian)	Complexity
(A) Optimized Uniform	0.0345	0.0520	1D search over X_max
(B) Lloyd-Max	0.0335	0.0465	Iterative (converges in ~10 steps)
(C) High-rate Compander	0.0338	0.0478	Analytic formula

Key observations: (1) Lloyd-Max always wins (it's the true optimum). (2) The high-rate compander is nearly as good and requires no iteration. (3) Using (C) as initialization for (B) gives fast convergence to the global optimum. (4) The gap between uniform and non-uniform grows for peakier distributions.

From Theory to Practice: The Design Pipeline

In practice, the recommended approach is:

Step 1

Compute λ^☆(x) ∝ f_X(x)^1/3 and build compressor g(x)

↓

Step 2

Get initial thresholds t_k = g⁻¹(k/M) and levels y_k = g⁻¹((k+0.5)/M)

↓

Step 3

Refine with a few Lloyd-Max iterations (typically 5–10 suffice)

done

This combines the analytic speed of high-rate theory with the exactness of Lloyd-Max. The compander initialization ensures we start near the global optimum, so even for non-log-concave distributions, we're unlikely to get trapped in a bad local minimum.

Key Takeaways

What you now know:
• Uniform quantization is optimal only for uniform distributions
• Lloyd-Max conditions: boundaries are midpoints (C2), levels are centroids (C1)
• Lloyd-Max algorithm = alternating minimization = 1D k-means
• High-rate optimal density: λ^☆(x) ∝ f_X(x)^1/3
• Companding implements non-uniform Q via compress+uniform+expand
• NF4 = Lloyd-Max for Gaussian = how we compress LLMs today

Looking Ahead: Vector Quantization

Everything in this lecture deals with scalar quantization — quantizing one number at a time. But real signals are multidimensional: an image patch is a vector in R⁶⁴, a weight matrix row is a vector in R^d. Vector quantization (VQ) generalizes Lloyd-Max to higher dimensions:

• Thresholds become Voronoi cells in R^d (complex polytopes, not intervals)

• Reconstruction levels become a codebook {y₁, ..., y_M} in R^d

• Lloyd's algorithm (k-means) generalizes directly: assign to nearest codebook vector, recompute centroids

• VQ can exploit correlations between dimensions that scalar quantization misses

VQ is the foundation of product quantization (used in FAISS for similarity search), VQ-VAE (discrete representation learning), and learned codebook quantization in modern ML systems. The theory from this lecture — alternating centroid and assignment updates — remains the core algorithm.

The Historical Arc

Let's trace the intellectual lineage:

Year	Contribution	Researcher(s)
1948	High-rate distortion analysis	W.R. Bennett (Bell Labs)
1951	Optimal point density formula	P.F. Panter & W. Dite
1957	Lloyd's algorithm (internal Bell Labs memo)	S. Lloyd
1960	Optimality conditions published	J. Max
1963	μ-law standardized for telephony	ITU-T
1982	Lloyd's algorithm finally published	S. Lloyd (IEEE)
1982	Rigorous Zador formula	P. Zador
2023	NF4 for LLM quantization	T. Dettmers et al.

Stuart Lloyd developed his algorithm at Bell Labs in 1957, but it wasn't published until 1982 — a 25-year gap! In the meantime, Joel Max independently derived the same conditions in 1960. The algorithm is therefore called "Lloyd-Max" in the quantization community, though in the clustering community it's simply "Lloyd's algorithm" or "k-means."

The 65-year arc from Bennett (1948) to Dettmers (2023) shows how fundamental theory in signal processing finds unexpected applications in machine learning. The mathematics doesn't expire.

Pilanci's message: "The connection between Lloyd-Max quantization and modern LLM compression is not a superficial analogy — it's the same mathematical object. NF4 IS Lloyd-Max. The theory that telecommunications engineers developed to make phone calls clearer now lets us fit 70-billion-parameter language models on consumer hardware."

References

1. S. Lloyd, "Least Squares Quantization in PCM," IEEE Trans. Info. Theory, 1982.
2. J. Max, "Quantizing for Minimum Distortion," IRE Trans. Info. Theory, 1960.
3. W.R. Bennett, "Spectra of Quantized Signals," Bell System Technical Journal, 1948.
4. P.F. Panter & W. Dite, "Quantization Distortion in PCM with Nonuniform Spacing of Levels," 1951.
5. T. Dettmers et al., "QLoRA: Efficient Finetuning of Quantized Language Models," NeurIPS, 2023.
6. A. Gersho & R.M. Gray, Vector Quantization and Signal Compression, Springer, 1992.

A practitioner wants to quantize neural network weights that follow a Laplacian distribution to 4 bits. What should they do?

Use uniform 4-bit quantization since it's simplest Use NF4 since it's designed for neural networks Run Lloyd-Max on the Laplacian distribution with M=16 to get optimal levels for their actual weight statistics

Lloyd-Max OptimalQuantizers