Quantization Noise — EE269 Lecture 3

Chapter 0: The Noise Model

You record a perfect sine wave — a pure tone at 440 Hz. You store it digitally. When you play it back, it doesn't sound quite the same. There's a faint hiss, a subtle graininess. Where did that noise come from? You didn't add it. No microphone was involved. The noise was created by the act of digitization itself.

This is quantization noise: the unavoidable error introduced when you round a continuous value to the nearest discrete level. Every A/D converter, every digital image, every neural network weight stored in INT8 — they all suffer from this fundamental distortion.

The beautiful result of this lecture: under certain conditions, we can model quantization as if someone simply added white noise to the signal. This additive noise model turns an ugly nonlinear operation into something we can analyze with standard linear tools.

The fundamental equation: If q(x) is the quantized version of x, then the quantization error is ε ≜ x − q(x). The entire theory asks: can we treat ε as a random variable with known statistics?

Quantization in Action

A smooth cosine signal (teal) gets mapped to discrete levels (orange staircase). The difference is the quantization error ε shown below.

Bits (b) 3

Look at the error waveform at the bottom. With few bits, the error is large and clearly correlated with the signal — it looks structured, not random. As you increase the bits, the error becomes smaller and starts to look like noise. This visual transition is the intuition behind Bennett's Theorem, which we'll prove in Chapter 6.

Where does quantization noise come from?

From the microphone or sensor From rounding continuous values to discrete levels during digitization From electromagnetic interference in wires

Chapter 1: Uniform Quantizer Properties

A uniform quantizer divides the real line into equal-width bins and maps every value within a bin to the bin's midpoint. Think of it as a staircase function: the input slides smoothly, but the output jumps in discrete steps.

The quantizer is completely described by two parameters:

Δ (step size) — the width of each bin
M = 2^b (number of levels) — determined by the number of bits b

The mapping q(x) works like this: find which bin x falls into, then snap to the midpoint of that bin. For a midrise quantizer with M levels:

q(x) = Δ · ⌊ x/Δ + 1/2 ⌋

The no-overload range (also called the granular region) is the interval where the quantizer operates normally:

|x| ≤ MΔ/2

Within this range, the quantization error is bounded:

|ε| = |x − q(x)| ≤ Δ/2

This is a hard guarantee: if the signal stays within the no-overload range, the maximum possible error is half a step size. Outside this range, the quantizer saturates (clips), and the error can be arbitrarily large.

Key insight: A b-bit quantizer has M = 2^b levels. If the signal has amplitude A, you set Δ = 2A/M to cover the range [−A, A]. More bits means smaller Δ means smaller errors. The tradeoff: more bits = more storage/bandwidth.

Quantizer Transfer Function

The staircase maps any input x to the nearest quantization level q(x). Orange region = no-overload range. Beyond it, the quantizer clips.

Levels (M) 8

Input range 1.0

No-Overload Region

Input stays within [−MΔ/2, MΔ/2]. Error bounded by Δ/2. Quantizer behaves predictably.

Overload Region

Input exceeds the range. Output clips at the extreme level. Error grows with signal amplitude. This is saturation distortion.

A 4-bit quantizer has how many discrete levels?

4 8 16 256

Chapter 2: Bennett's Theorem

Here is the dream: treat the quantization error ε as if it were a random variable, independent of the signal, uniformly distributed on [−Δ/2, Δ/2]. If we could do this, the entire nonlinear quantization operation reduces to a trivial additive noise channel:

q(x) = x + ε, ε ~ Uniform[−Δ/2, Δ/2]

This would give us a complete statistical characterization for free. But can we actually do this? Under what conditions?

Pilanci states two assumptions explicitly:

(A1)

ε_n is uniformly distributed on [−Δ/2, Δ/2]

(A2)

ε_n is uncorrelated with x_n and with ε_m for m ≠ n

Pilanci's warning: "These assumptions are usually false!" For any fixed signal (like a DC level or a slow sine), the error is completely determined by the signal — it's a deterministic function, not a random variable. The assumptions only become approximately true under specific conditions.

Bennett's Theorem (1948) tells us when the dream holds. As M → ∞ and Δ → 0, assumptions (A1) and (A2) hold if:

The input signal remains in the no-overload region
M is large (many quantization levels)
Δ is small (fine step size)
The input PDF is smooth (no point masses or discontinuities)

The intuition: when the step size is much smaller than the scale of variation of the signal's PDF, each quantization bin sees an approximately uniform slice of the distribution. The error within each bin becomes approximately uniform, and across bins it becomes approximately uncorrelated.

When Does the Noise Model Work?

Compare the histogram of actual quantization errors (bars) versus the ideal uniform distribution (dashed line). With few bits, the mismatch is severe. With many bits and a smooth input, they converge.

Bits (b) 3

Signal type Gaussian

Consequences of Bennett's Theorem: When the assumptions hold, we get: E[ε_n] = 0 (zero mean) and Var[ε_n] = Δ²/12 (variance depends only on step size). These two numbers characterize everything about the noise.

The variance formula Δ²/12 comes directly from the uniform distribution: if X ~ Uniform[−a, a], then Var[X] = (2a)²/12 = a²/3. With a = Δ/2, we get Var = (Δ/2)²/3 = Δ²/12.

According to Pilanci, why are assumptions (A1) and (A2) "usually false"?

For a fixed signal, the quantization error is a deterministic function of the signal, not a random variable Because quantizers are poorly manufactured Because digital systems introduce rounding errors in the CPU

Chapter 3: The 6 dB Rule

Now we have all the pieces. If ε ~ Uniform[−Δ/2, Δ/2], then the noise power is:

E[ε²] = Var[ε] = Δ² / 12

The Signal-to-Quantization-Noise Ratio (SQNR) measures how much signal power we have relative to the quantization noise power:

SQNR = E[x²] / E[ε²] = 12σ²_x / Δ²

In decibels:

SQNR_dB = 10 log₁₀(12σ²_x / Δ²)

Now here's the key manipulation. If we set the quantizer range to cover ±Kσ of the signal (a common design choice), then Δ = 2Kσ/2^b. Substituting:

SQNR_dB = 6.02b + 10log₁₀(12/4K²)

The 6 dB rule: each additional bit of resolution improves the SQNR by approximately 6.02 dB. Halving the step size (which is what adding one bit does) reduces the noise power by a factor of 4, which is 10 log₁₀(4) ≈ 6.02 dB.

The 6 dB rule: Every additional bit gives you 6 dB of signal-to-noise improvement. This is the single most important formula in digital audio, imaging, and now ML quantization. It tells you exactly what you pay for each bit you throw away.

Bits (b)	SQNR (dB)	Quality	Standard
4	~25	Severe distortion	—
8	~50	Telephone quality	PCM telephony
12	~74	FM radio quality	Early digital audio
16	~98	CD quality	CD / WAV
24	~146	Studio / transparent	Professional audio

SQNR Explorer — The 6 dB Rule in Action

Adjust the bit depth. Watch the SQNR climb by ~6 dB per bit. The top shows original vs quantized signal; the bottom shows the error magnified. The meter on the right shows quality level.

Bits (b) 8

Signal frequency 3.0

Signal amplitude 0.9

      SQNR = 49.9 dB
      | 
      Quality: Telephone
    

Observe that when you go from 8 bits to 16 bits (adding 8 bits), the SQNR jumps by about 48 dB. That's exactly 8 × 6 = 48. The rule is remarkably precise.

Why 6.02 exactly? Adding one bit doubles M, halving Δ. Noise power Δ²/12 drops by factor 4. In dB: 10 log₁₀(4) = 10 · 2 log₁₀(2) = 20 · 0.301 = 6.02 dB. The factor of 2 in the log converts a power ratio to the familiar "6 dB per bit."

You have a 12-bit quantizer and need to improve SQNR by 18 dB. How many extra bits do you need?

2 bits 3 bits 6 bits 18 bits

Chapter 4: Practical Applications

The 6 dB rule isn't just theory — it literally shaped the engineering of every digital media standard you use daily. Engineers picked bit depths by asking: "How many dB of SQNR does the application need?"

Audio Standards

The human ear has a dynamic range of about 120-140 dB (from the threshold of hearing to the threshold of pain). CD audio at 16 bits gives ~98 dB of SQNR — not enough for the full range, but enough that the quantization noise sits below the noise floor of typical playback environments.

Professional studios use 24 bits (~146 dB) not because humans can hear that range, but because it provides headroom for mixing. When you add, filter, and process signals, you accumulate quantization errors. Starting with more bits means the accumulated errors stay inaudible.

Fixed-Rate (Lossless)

Every sample gets the same number of bits. WAV, FLAC, AIFF. FLAC uses entropy coding to compress further without losing any bits. No quantization beyond the original A/D conversion.

Variable-Rate (Lossy)

MP3, AAC, MP4. Adapt bits per frequency band based on psychoacoustic masking. Quiet parts near loud parts are quantized more coarsely — you won't hear the noise because the loud sound masks it.

Image Quantization

For images, 8 bits per channel gives ~48 dB SQNR. This turns out to be almost exactly the point where human vision saturates — we cannot distinguish more than about 200-250 gray levels in a single image under normal viewing conditions. This is why JPEG, PNG, and virtually every display standard uses 8 bits per channel.

HDR imaging (10-12 bits per channel) isn't about seeing finer gradations — it's about covering a wider dynamic range (bright highlights + dark shadows in the same image) without saturation clipping.

Design principle: Match the bit depth to the perceptual resolution of the target sense. Audio: 16-24 bits (ear has 120+ dB range). Images: 8 bits per channel (eye saturates at ~48 dB per channel). Going beyond the perceptual limit wastes bandwidth with no quality gain.

Neural Network Quantization

Modern ML uses exactly this theory when quantizing neural network weights from FP32 (23-bit mantissa) down to INT8 (8 bits) or even INT4 (4 bits). The 6 dB rule tells you: going from FP32 to INT8 costs you roughly (23 − 8) × 6 = 90 dB of SQNR on the weights. The empirical question is whether the network can tolerate that level of weight noise without losing accuracy.

Bit Depth Comparison

See how a signal looks at different bit depths simultaneously. The blue region shows the quantization error envelope.

Why do images typically use only 8 bits per channel while audio uses 16-24 bits?

Images are less important than audio Human vision saturates at ~48 dB per channel (8 bits), while hearing spans 120+ dB Storage is cheaper for images

Chapter 5: Saturation & Overload

Everything we've derived so far — the Δ²/12 noise, the 6 dB rule, Bennett's Theorem — assumes the signal stays in the no-overload range [−MΔ/2, MΔ/2]. What happens when it doesn't?

When the signal exceeds the quantizer's range, it clips. The output is stuck at the maximum (or minimum) level regardless of how large the input gets. The error is no longer bounded by Δ/2 — it can be arbitrarily large:

|x| > MΔ/2 ⇒ ε = x − q(x) &text{can be} ≫ Δ/2

Clipping distortion sounds terrible — it generates harsh harmonics. In images, it produces blown-out white or crushed black regions where detail is irrecoverably lost. This is far worse than the gentle granular noise within the no-overload region.

The fundamental tradeoff: For a given bit budget b, you must choose Δ. Make Δ small (fine quantization) and the no-overload range shrinks — you clip more often. Make Δ large and the range grows but quantization noise increases. You can't win both ways with a fixed number of bits.

Formally, the no-overload range has total width MΔ = 2^bΔ. If your signal has standard deviation σ, you want the range to cover ±Kσ for some K (typically K = 3 or 4 for Gaussian signals). This means:

Δ = 2Kσ / 2^b

Larger K means less clipping but coarser quantization. Smaller K means finer quantization but more clipping. In practice, K = 4 captures 99.99% of a Gaussian signal's samples.

Automatic Gain Control (AGC)

AGC dynamically adjusts the signal amplitude before quantization to keep it within the no-overload range. Your phone does this constantly during calls: when you speak loudly, it turns down the gain; when you whisper, it turns up the gain. This maximizes the use of available quantization levels without clipping.

Clipping Demonstration

Increase the signal amplitude until it exceeds the quantizer's range. Watch the error spike at the clipping points (red regions).

Amplitude 0.8

Bits 4

      Samples clipped: 0%
    

What is the tradeoff when choosing Δ for a fixed number of bits?

Speed vs accuracy Cost vs quality Smaller Δ means less granular noise but more clipping; larger Δ means less clipping but more noise

Chapter 6: Bennett's Theorem — Proof Sketch

We now prove why the quantization error becomes uniform as Δ → 0. The proof uses a clever decomposition of the CDF of the error conditioned on which quantization bin the signal falls into.

Setup

Let the quantization bins be indexed by k. The k-th bin spans the interval [x_k − Δ/2, x_k + Δ/2] where x_k is the bin center. The quantization error within this bin is:

ε = x − x_k, ε ∈ [−Δ/2, Δ/2]

CDF Decomposition

We want the CDF of ε: P(ε ≤ e) for e ∈ [−Δ/2, Δ/2]. We condition on which bin the signal falls into:

P(ε ≤ e) = ∑_k P(ε ≤ e | x ∈ bin k) · P(x ∈ bin k)

Given that x is in bin k, the error ε = x − x_k ≤ e is equivalent to x ≤ x_k + e. So:

P(ε ≤ e | x ∈ bin k) = P(x ≤ x_k + e | x_k − Δ/2 ≤ x ≤ x_k + Δ/2)

If the PDF f_x is approximately constant over the bin (this is the "smooth PDF" condition — valid when Δ is small relative to the scale of variation of f_x), then:

P(ε ≤ e | x ∈ bin k) ≈ (e + Δ/2) / Δ

This is the CDF of a Uniform[−Δ/2, Δ/2] distribution! And crucially, it doesn't depend on k. So after summing over all bins (which sums the weights P(x ∈ bin k) to 1), we get:

P(ε ≤ e) ≈ (e + Δ/2) / Δ ∀ e ∈ [−Δ/2, Δ/2]

The key step: The approximation "f_x constant within each bin" is what makes this work. It requires Δ to be small relative to the scale on which f_x varies. For a smooth PDF, this is satisfied when M is large and Δ is small. For a PDF with point masses (like a DC signal), it never holds — explaining why the assumptions are "usually false" for pathological signals.

Uncorrelation (A2)

The proof of (A2) follows similarly. The error at time n depends on which bin x_n occupies. If x_n and x_m land in different bins (which they generically do for smooth random processes as Δ → 0), then the errors ε_n and ε_m are determined by independent "within-bin positions" and hence are uncorrelated.

Proof Visualization: Within-Bin Distribution

Each bin slices the PDF. If the PDF is approximately flat within each bin (small Δ), the conditional position within the bin is approximately uniform. Shrink Δ to see convergence.

Δ (step size) 1.00

Summary of conditions: Bennett's Theorem requires: (1) no clipping, (2) many levels, (3) small Δ, (4) smooth input PDF. In practice, conditions (2)-(4) are satisfied when b ≥ 6 for typical signals. For coarse quantization (b ≤ 4), the additive noise model breaks down and you must analyze the quantizer directly.

What is the key approximation in Bennett's Theorem proof?

The input PDF is approximately constant within each quantization bin when Δ is small The quantizer is perfectly linear The signal is periodic

Chapter 7: Mastery

You now have the complete theory of quantization noise. Let's consolidate.

The SQNR Formula Card:
• Noise power: E[ε²] = Δ²/12
• SQNR = 12σ²_x/Δ²
• SQNR_dB ≈ 6.02b + C (where C depends on loading factor)
• Each additional bit: +6.02 dB
• Doubling the number of levels: +6 dB

Bit-Depth Selection Guide

Application	Bits	SQNR	Rationale
ML weights (aggressive)	4	~25 dB	Networks surprisingly tolerant; retrain to compensate
ML inference (standard)	8	~50 dB	INT8 widely supported on GPUs; minimal accuracy loss
Telephony	8	~50 dB	Speech intelligibility preserved
FM broadcast	12	~74 dB	Matches analog FM noise floor
Consumer audio (CD)	16	~98 dB	Exceeds typical listening environment noise
Professional audio	24	~146 dB	Headroom for mixing/processing chains
Images (per channel)	8	~48 dB	Human vision saturates at ~250 levels
HDR imaging	10-12	60-74 dB	Extended dynamic range, not finer gradation

Connection to ML: INT8 Quantization

When you quantize a neural network from FP32 to INT8, you are exactly applying the theory of this lecture. The weights are the "signal." The quantization error on those weights is the "noise." Bennett's Theorem tells you the noise is approximately uniform with variance Δ²/12.

The empirical success of INT8 inference means neural networks are remarkably robust to ~50 dB of SQNR on their weights. Some networks even tolerate INT4 (~25 dB) — roughly the quality of a 1950s telephone applied to the weight tensor! The key insight: neural networks have massive redundancy, so individual weight errors tend to average out across millions of parameters.

More aggressive quantization (INT4, INT2) requires quantization-aware training — retraining the network to compensate for the noise, just like AGC adjusts gain to avoid clipping.

Connecting the lectures: Lecture 2 covered sampling (discretizing in time). This lecture covered quantization (discretizing in amplitude). Together, sampling + quantization = complete digitization. Lecture 4 will cover the information-theoretic limits: given a certain distortion budget, what's the minimum number of bits per sample (rate-distortion theory)?

Key Equations Summary

Quantizer

q(x) = Δ · ⌊x/Δ + 1/2⌋, M = 2^b levels

↓

Error

ε = x − q(x), |ε| ≤ Δ/2 in no-overload

↓

Bennett

ε ~ Uniform[−Δ/2, Δ/2] when Δ small, PDF smooth

↓

Statistics

E[ε] = 0, Var[ε] = Δ²/12

↓

6 dB Rule

SQNR_dB ≈ 6.02b + const, +1 bit = +6 dB

A neural network quantized to INT8 has weight SQNR of ~50 dB. If you further quantize to INT4, approximately how much SQNR do you lose?

About 12 dB About 24 dB (4 bits × 6 dB/bit) About 50 dB About 6 dB