Every digital system destroys information. The question is: how much, and can we model the destruction as simple additive noise?
You record a perfect sine wave — a pure tone at 440 Hz. You store it digitally. When you play it back, it doesn't sound quite the same. There's a faint hiss, a subtle graininess. Where did that noise come from? You didn't add it. No microphone was involved. The noise was created by the act of digitization itself.
This is quantization noise: the unavoidable error introduced when you round a continuous value to the nearest discrete level. Every A/D converter, every digital image, every neural network weight stored in INT8 — they all suffer from this fundamental distortion.
The beautiful result of this lecture: under certain conditions, we can model quantization as if someone simply added white noise to the signal. This additive noise model turns an ugly nonlinear operation into something we can analyze with standard linear tools.
A smooth cosine signal (teal) gets mapped to discrete levels (orange staircase). The difference is the quantization error ε shown below.
Look at the error waveform at the bottom. With few bits, the error is large and clearly correlated with the signal — it looks structured, not random. As you increase the bits, the error becomes smaller and starts to look like noise. This visual transition is the intuition behind Bennett's Theorem, which we'll prove in Chapter 6.
A uniform quantizer divides the real line into equal-width bins and maps every value within a bin to the bin's midpoint. Think of it as a staircase function: the input slides smoothly, but the output jumps in discrete steps.
The quantizer is completely described by two parameters:
The mapping q(x) works like this: find which bin x falls into, then snap to the midpoint of that bin. For a midrise quantizer with M levels:
The no-overload range (also called the granular region) is the interval where the quantizer operates normally:
Within this range, the quantization error is bounded:
This is a hard guarantee: if the signal stays within the no-overload range, the maximum possible error is half a step size. Outside this range, the quantizer saturates (clips), and the error can be arbitrarily large.
The staircase maps any input x to the nearest quantization level q(x). Orange region = no-overload range. Beyond it, the quantizer clips.
Input stays within [−MΔ/2, MΔ/2]. Error bounded by Δ/2. Quantizer behaves predictably.
Input exceeds the range. Output clips at the extreme level. Error grows with signal amplitude. This is saturation distortion.
Here is the dream: treat the quantization error ε as if it were a random variable, independent of the signal, uniformly distributed on [−Δ/2, Δ/2]. If we could do this, the entire nonlinear quantization operation reduces to a trivial additive noise channel:
This would give us a complete statistical characterization for free. But can we actually do this? Under what conditions?
Pilanci states two assumptions explicitly:
Bennett's Theorem (1948) tells us when the dream holds. As M → ∞ and Δ → 0, assumptions (A1) and (A2) hold if:
The intuition: when the step size is much smaller than the scale of variation of the signal's PDF, each quantization bin sees an approximately uniform slice of the distribution. The error within each bin becomes approximately uniform, and across bins it becomes approximately uncorrelated.
Compare the histogram of actual quantization errors (bars) versus the ideal uniform distribution (dashed line). With few bits, the mismatch is severe. With many bits and a smooth input, they converge.
The variance formula Δ²/12 comes directly from the uniform distribution: if X ~ Uniform[−a, a], then Var[X] = (2a)²/12 = a²/3. With a = Δ/2, we get Var = (Δ/2)²/3 = Δ²/12.
Now we have all the pieces. If ε ~ Uniform[−Δ/2, Δ/2], then the noise power is:
The Signal-to-Quantization-Noise Ratio (SQNR) measures how much signal power we have relative to the quantization noise power:
In decibels:
Now here's the key manipulation. If we set the quantizer range to cover ±Kσ of the signal (a common design choice), then Δ = 2Kσ/2b. Substituting:
The 6 dB rule: each additional bit of resolution improves the SQNR by approximately 6.02 dB. Halving the step size (which is what adding one bit does) reduces the noise power by a factor of 4, which is 10 log10(4) ≈ 6.02 dB.
| Bits (b) | SQNR (dB) | Quality | Standard |
|---|---|---|---|
| 4 | ~25 | Severe distortion | — |
| 8 | ~50 | Telephone quality | PCM telephony |
| 12 | ~74 | FM radio quality | Early digital audio |
| 16 | ~98 | CD quality | CD / WAV |
| 24 | ~146 | Studio / transparent | Professional audio |
Adjust the bit depth. Watch the SQNR climb by ~6 dB per bit. The top shows original vs quantized signal; the bottom shows the error magnified. The meter on the right shows quality level.
Observe that when you go from 8 bits to 16 bits (adding 8 bits), the SQNR jumps by about 48 dB. That's exactly 8 × 6 = 48. The rule is remarkably precise.
The 6 dB rule isn't just theory — it literally shaped the engineering of every digital media standard you use daily. Engineers picked bit depths by asking: "How many dB of SQNR does the application need?"
The human ear has a dynamic range of about 120-140 dB (from the threshold of hearing to the threshold of pain). CD audio at 16 bits gives ~98 dB of SQNR — not enough for the full range, but enough that the quantization noise sits below the noise floor of typical playback environments.
Professional studios use 24 bits (~146 dB) not because humans can hear that range, but because it provides headroom for mixing. When you add, filter, and process signals, you accumulate quantization errors. Starting with more bits means the accumulated errors stay inaudible.
Every sample gets the same number of bits. WAV, FLAC, AIFF. FLAC uses entropy coding to compress further without losing any bits. No quantization beyond the original A/D conversion.
MP3, AAC, MP4. Adapt bits per frequency band based on psychoacoustic masking. Quiet parts near loud parts are quantized more coarsely — you won't hear the noise because the loud sound masks it.
For images, 8 bits per channel gives ~48 dB SQNR. This turns out to be almost exactly the point where human vision saturates — we cannot distinguish more than about 200-250 gray levels in a single image under normal viewing conditions. This is why JPEG, PNG, and virtually every display standard uses 8 bits per channel.
HDR imaging (10-12 bits per channel) isn't about seeing finer gradations — it's about covering a wider dynamic range (bright highlights + dark shadows in the same image) without saturation clipping.
Modern ML uses exactly this theory when quantizing neural network weights from FP32 (23-bit mantissa) down to INT8 (8 bits) or even INT4 (4 bits). The 6 dB rule tells you: going from FP32 to INT8 costs you roughly (23 − 8) × 6 = 90 dB of SQNR on the weights. The empirical question is whether the network can tolerate that level of weight noise without losing accuracy.
See how a signal looks at different bit depths simultaneously. The blue region shows the quantization error envelope.
Everything we've derived so far — the Δ²/12 noise, the 6 dB rule, Bennett's Theorem — assumes the signal stays in the no-overload range [−MΔ/2, MΔ/2]. What happens when it doesn't?
When the signal exceeds the quantizer's range, it clips. The output is stuck at the maximum (or minimum) level regardless of how large the input gets. The error is no longer bounded by Δ/2 — it can be arbitrarily large:
Clipping distortion sounds terrible — it generates harsh harmonics. In images, it produces blown-out white or crushed black regions where detail is irrecoverably lost. This is far worse than the gentle granular noise within the no-overload region.
Formally, the no-overload range has total width MΔ = 2bΔ. If your signal has standard deviation σ, you want the range to cover ±Kσ for some K (typically K = 3 or 4 for Gaussian signals). This means:
Larger K means less clipping but coarser quantization. Smaller K means finer quantization but more clipping. In practice, K = 4 captures 99.99% of a Gaussian signal's samples.
AGC dynamically adjusts the signal amplitude before quantization to keep it within the no-overload range. Your phone does this constantly during calls: when you speak loudly, it turns down the gain; when you whisper, it turns up the gain. This maximizes the use of available quantization levels without clipping.
Increase the signal amplitude until it exceeds the quantizer's range. Watch the error spike at the clipping points (red regions).
We now prove why the quantization error becomes uniform as Δ → 0. The proof uses a clever decomposition of the CDF of the error conditioned on which quantization bin the signal falls into.
Let the quantization bins be indexed by k. The k-th bin spans the interval [xk − Δ/2, xk + Δ/2] where xk is the bin center. The quantization error within this bin is:
We want the CDF of ε: P(ε ≤ e) for e ∈ [−Δ/2, Δ/2]. We condition on which bin the signal falls into:
Given that x is in bin k, the error ε = x − xk ≤ e is equivalent to x ≤ xk + e. So:
If the PDF fx is approximately constant over the bin (this is the "smooth PDF" condition — valid when Δ is small relative to the scale of variation of fx), then:
This is the CDF of a Uniform[−Δ/2, Δ/2] distribution! And crucially, it doesn't depend on k. So after summing over all bins (which sums the weights P(x ∈ bin k) to 1), we get:
The proof of (A2) follows similarly. The error at time n depends on which bin xn occupies. If xn and xm land in different bins (which they generically do for smooth random processes as Δ → 0), then the errors εn and εm are determined by independent "within-bin positions" and hence are uncorrelated.
Each bin slices the PDF. If the PDF is approximately flat within each bin (small Δ), the conditional position within the bin is approximately uniform. Shrink Δ to see convergence.
You now have the complete theory of quantization noise. Let's consolidate.
| Application | Bits | SQNR | Rationale |
|---|---|---|---|
| ML weights (aggressive) | 4 | ~25 dB | Networks surprisingly tolerant; retrain to compensate |
| ML inference (standard) | 8 | ~50 dB | INT8 widely supported on GPUs; minimal accuracy loss |
| Telephony | 8 | ~50 dB | Speech intelligibility preserved |
| FM broadcast | 12 | ~74 dB | Matches analog FM noise floor |
| Consumer audio (CD) | 16 | ~98 dB | Exceeds typical listening environment noise |
| Professional audio | 24 | ~146 dB | Headroom for mixing/processing chains |
| Images (per channel) | 8 | ~48 dB | Human vision saturates at ~250 levels |
| HDR imaging | 10-12 | 60-74 dB | Extended dynamic range, not finer gradation |
When you quantize a neural network from FP32 to INT8, you are exactly applying the theory of this lecture. The weights are the "signal." The quantization error on those weights is the "noise." Bennett's Theorem tells you the noise is approximately uniform with variance Δ²/12.
The empirical success of INT8 inference means neural networks are remarkably robust to ~50 dB of SQNR on their weights. Some networks even tolerate INT4 (~25 dB) — roughly the quality of a 1950s telephone applied to the weight tensor! The key insight: neural networks have massive redundancy, so individual weight errors tend to average out across millions of parameters.
More aggressive quantization (INT4, INT2) requires quantization-aware training — retraining the network to compensate for the noise, just like AGC adjusts gain to avoid clipping.