EE269 Lecture 5 — Mert Pilanci, Stanford

Dithering & Stochastic
Rounding

How adding noise before quantizing can paradoxically improve quality — and why stochastic rounding is essential for low-precision ML training.

Prerequisites: EE269 Lectures 3-4 (Quantization, Lloyd-Max) + Basic Probability. That's it.
8
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Why Randomness Helps

You're quantizing a smooth gradient — say, a slowly changing brightness across an image, or a slowly drifting signal. Your quantizer has Δ = 1. The input ramps smoothly from 0.0 to 5.0 over 1000 samples. What does the output look like?

Flat steps. The output stays at 0 for 100 samples, then jumps to 1, stays there for 100 samples, jumps to 2, and so on. Instead of a smooth ramp, you get a staircase with visible contour lines — sharp boundaries between levels. In images, these appear as banding. In audio, they create audible idle tones and distortion.

The problem isn't that the error is large — it's bounded by Δ/2 = 0.5 at worst. The problem is that the error is structured. It's perfectly correlated with the input. For a slowly varying input, the error is a sawtooth wave at a frequency determined by the input slope. That sawtooth is periodic, and periodic artifacts are perceptually far worse than random noise of the same energy.

The paradox: Adding random noise to the input BEFORE quantizing makes the output look (and sound) better. You're deliberately corrupting the signal, yet the result has fewer artifacts. How? The noise breaks up the structured error pattern, replacing visible contours with invisible grain.

This technique is called dithering. The word comes from Old English "didderen" (to tremble). The idea dates to World War II, when engineers discovered that adding mechanical vibration to analog-mechanical bombing computers made them more accurate — the vibrations prevented the gears from settling into discrete positions. By the 1960s, Roberts (MIT) and Schuchman (Bell Labs) had formalized the theory for digital systems.

In this lecture, Pilanci shows two dither architectures, proves a remarkable linearization property (the quantizer becomes equivalent to adding white noise!), and connects this to stochastic rounding — the technique that makes low-precision ML training possible.

The Landscape of Quantization Artifacts

Without dither, uniform quantizers produce several types of structured artifacts:

Contouring: Flat regions in images where the signal crosses a quantization boundary. Appears as visible "steps" in smooth gradients. Severity increases with fewer bits.

Limit cycles: In feedback systems (sigma-delta converters, IIR filters), a constant input causes the output to oscillate periodically between two quantization levels. This creates audible tones at frequencies unrelated to the input.

Idle channel noise: When the input is zero (or constant), the quantizer should output silence. But in a feedback loop, the quantization error recirculates and can create low-level oscillations — "idle tones" audible in quiet passages.

Granular noise: For inputs within the quantizer's range, the error is bounded but signal-dependent. For slowly varying signals, this creates buzzing or grinding sounds that track the input envelope.

All of these artifacts are signal-dependent — they change character as the input changes. Dithering eliminates all of them by making the error independent of the input.

Structured Error vs. Dithered Error

A smooth ramp quantized with and without dither. Toggle to compare. Notice: without dither, the error is a periodic sawtooth. With dither, the error looks like white noise.

Showing undithered quantization. The error is a perfect sawtooth — structured distortion.
Why are structured quantization artifacts worse than random noise of the same energy?

Chapter 1: Non-Subtractive Dither

The simplest dither architecture has three steps: add noise, quantize, output. We call it non-subtractive because the dither noise is never removed — it stays in the output.

The Architecture

Input x
The original signal sample
Add Dither: z = x + d
d is uniform on [-Δ/2, Δ/2]
Quantize: y = QΔ(z)
Uniform rounding quantizer with step Δ
Output y
Quantized + dithered result

The uniform rounding quantizer is defined as:

QΔ(z) = Δ · round(z / Δ)

This rounds to the nearest multiple of Δ. For Δ = 1, it's just ordinary rounding: Q(2.3) = 2, Q(2.7) = 3, Q(2.5) = 3 (round half up).

The dither signal d is drawn independently for each sample from a uniform distribution on [−Δ/2, +Δ/2]. Crucially, d must be independent of the input x.

Why uniform on [−Δ/2, +Δ/2]? This is the same width as one quantization bin. The dither "smears" the input across one full bin width. If x sits at position 0.3 within a bin, the dithered value z = x + d now has roughly equal probability of rounding up or down. This is what breaks the deterministic staircase.

What Happens to the Error?

Without dither, the quantization error is ε = QΔ(x) − x. This error is a deterministic function of x: a sawtooth repeating with period Δ.

With non-subtractive dither, the output is y = QΔ(x + d). The error is:

εdither = QΔ(x + d) − x = [QΔ(x + d) − (x + d)] + d

The first bracket is the quantization error of z = x + d, which is uniform on [−Δ/2, Δ/2] (because d randomizes where z lands within a bin). The second term d is also uniform on [−Δ/2, Δ/2]. The total error is the sum of these two terms.

The total noise power increases: instead of Δ2/12 (from quantization alone), we now have slightly more. But the error is uncorrelated with the input. That's the trade: higher noise floor, but no structured artifacts.

Non-Subtractive Dither: Step by Step

Watch a single sample get dithered and quantized. Drag the input value. The dither d is redrawn each time.

Input x 0.70

Worked Example

Let Δ = 1, x = 1.8. Without dither: Q(1.8) = 2. Error = 0.2.

With dither: d is uniform on [−0.5, 0.5]. Say d = −0.3. Then z = 1.8 + (−0.3) = 1.5. Q(1.5) = 2. Output = 2. Error = 0.2.

Another draw: d = 0.4. Then z = 1.8 + 0.4 = 2.2. Q(2.2) = 2. Output = 2. Error = 0.2.

Another draw: d = 0.3. Then z = 1.8 + 0.3 = 2.1. Q(2.1) = 2. Output = 2. Error = 0.2.

Another draw: d = −0.4. Then z = 1.8 + (−0.4) = 1.4. Q(1.4) = 1. Output = 1. Error = −0.8.

Notice: most of the time the output is 2 (close to the true value), but occasionally it jumps to 1 (a large error). Over many samples, the average output is close to x. We'll formalize this in the next chapter.

When Does Dithering Round Down?

For x = 1.8 and Δ = 1: the distance from x to the next lower level (1.0) is 0.8, and to the upper level (2.0) is 0.2. Dither rounds down whenever z = x + d falls below 1.5 (the midpoint), i.e., when d < −0.3. Since d is uniform on [−0.5, 0.5], P(d < −0.3) = 0.2/1.0 = 0.2.

So P(round down) = 0.2 = distance to upper level / Δ. And P(round up) = 0.8 = distance to lower level / Δ. This is exactly stochastic rounding (Chapter 4)! Non-subtractive dither with uniform noise on [−Δ/2, Δ/2] is probabilistically equivalent to stochastic rounding.

Noise Power Analysis

The total output error for non-subtractive dither has two components: the quantization error η and the dither noise d that remains in the output. Since the output is y = Q(x + d) = x + d + η, the error relative to x is d + η.

If η and d were independent, the noise power would be Var(η) + Var(d) = Δ2/12 + Δ2/12 = Δ2/6. In practice, η and d aren't fully independent (they're both functions of d), but the total noise power is bounded between Δ2/12 and Δ2/6 depending on x's position within the bin.

The cost of non-subtractive dither: You pay up to 3 dB more noise power compared to undithered quantization. But you eliminate all signal-dependent distortion. For perceptual applications, this is almost always a good trade.
In non-subtractive dither with Δ = 1, what distribution does the dither signal d follow?

Chapter 2: The Linearization Property

Here is the most remarkable result about dithered quantization. Under the right conditions, a nonlinear quantizer becomes equivalent to a linear channel plus white noise. The quantizer effectively disappears, replaced by simple additive noise that is independent of the input.

The Theorem (Schuchman, 1964)

Let x be any input signal. Let d be independent uniform noise on [−Δ/2, +Δ/2]. Define the dithered quantizer output:

y = QΔ(x + d)

Then the quantization error η = QΔ(x + d) − (x + d) satisfies:

• η is uniformly distributed on [−Δ/2, +Δ/2]

• η is independent of the input x

LINEARIZATION: The output of the dithered quantizer can be written as:

y = QΔ(x + d) = x + d + η

where η ~ Uniform[−Δ/2, Δ/2] is independent of x. The nonlinear quantizer has been linearized into: output = input + noise.

The Proof

Define the fractional part of x/Δ as:

f = x/Δ − floor(x/Δ) ∈ [0, 1)

The position of x within its quantization bin is f·Δ ∈ [0, Δ). After adding dither d ~ Uniform[−Δ/2, Δ/2], the position of z = x + d within a bin is:

z mod Δ = (f·Δ + d) mod Δ

Since d is uniform on an interval of width Δ, and addition mod Δ by a uniform variable yields a uniform result (regardless of f), the fractional position of z within its bin is uniform on [0, Δ). This means:

Step 1: The quantization error η = QΔ(z) − z depends only on where z sits within its bin. Since that position is uniform (thanks to d), η is uniform on [−Δ/2, Δ/2].

Step 2: Because d is independent of x, and η depends only on (x + d) mod Δ which is fully determined by d's randomness, η is independent of x.

This is the formal version of the intuition: dither randomizes the position within the bin, making the error behave like white noise regardless of where x sits.

Why This Matters

Without dither: error depends on x (deterministic sawtooth). The quantizer is nonlinear.

With dither: error is independent of x (white noise). The quantizer behaves like a linear system + additive noise. This makes analysis tractable — you can model the quantizer as a simple noise source in any system-level design.

The engineering significance: In feedback systems (sigma-delta modulators, control loops), a signal-dependent quantization error causes limit cycles, spurious oscillations, and instability. Dithering eliminates these by breaking the correlation between error and input. The error becomes benign white noise that averages out.

Numerical Verification

Let Δ = 1. Fix x = 0.3 (always the same input). Generate 10,000 independent dither samples di ~ Uniform[−0.5, 0.5]. For each, compute yi = Q(0.3 + di) and the error ηi = yi − (0.3 + di).

Histogram the errors: you get a perfect uniform distribution on [−0.5, 0.5], regardless of the choice x = 0.3. Try x = 0.9, x = 0.0, x = −1.7 — same histogram every time. That's independence.

Why "Modular Arithmetic" Makes This Work

The deep reason linearization works is a property of modular addition. If U ~ Uniform[0, 1) and c is any constant, then (U + c) mod 1 ~ Uniform[0, 1). Adding a constant and taking the modulus preserves uniformity. This is why:

• The fractional position of x within its bin is some constant f ∈ [0, 1)

• d shifts this position by a uniform random amount

• After the shift (modulo Δ), the position within the landing bin is uniform

• Therefore the error (which depends only on position within the bin) is uniform

This argument works for ANY input x, which is why the error is independent of x. The modular arithmetic "forgets" where x was.

Conditions for Linearization

The linearization theorem requires:

• d is uniform on [−Δ/2, Δ/2] (exactly one bin width)

• d is independent of x

• The quantizer is the uniform midpoint (rounding) quantizer

If d has a different distribution (e.g., Gaussian), linearization doesn't hold exactly — the error becomes approximately but not perfectly independent of x. Gaussian dither is sometimes used anyway for practical reasons (no hard-bounded support), accepting approximate rather than exact linearization.

Linearization: Error Independence

Fix x and generate 2000 dither samples. The error histogram is always uniform — independent of x. Change x with the slider to verify.

Input x 0.30
The histogram shape doesn't change as you drag x. That's independence.
After proper dithering, the quantization error η is:

Chapter 3: Subtractive Dither

Non-subtractive dither has a cost: the noise d stays in the output, increasing total error power. Can we get the linearization benefit while removing the dither noise? Yes — if we know d at the decoder.

Architecture B: Subtractive

Input x
Original signal
Add: z = x + d
d ~ Uniform[-Δ/2, Δ/2], shared with decoder
Quantize: q = QΔ(z)
Transmit q (or its index)
Subtract: y = q − d
Remove the known dither
Output y = x + η
η = Q(x+d) − (x+d), pure quantization noise

The key difference: both encoder and decoder share the same dither sequence d. The encoder adds d before quantizing; the decoder subtracts d after receiving the quantized value. The dither cancels out, leaving only the quantization error η.

The Result

y = QΔ(x + d) − d = x + [QΔ(x + d) − (x + d)] = x + η

From the linearization theorem (Chapter 2), η is uniform on [−Δ/2, Δ/2] and independent of x. So the subtractive dither output is:

y = x + η,    η ~ Uniform[−Δ/2, Δ/2],   η ⊥ x

The noise power is:

E[η2] = Δ2/12

This is exactly the same as the undithered quantizer's noise power for uniform input! But now the error is guaranteed independent of x for any input distribution. We get the exact same noise floor as ideal quantization, plus the independence guarantee.

Subtractive dither gives you everything: Same noise power as undithered (Δ2/12), but with guaranteed independence from the input. No structured artifacts. No limit cycles. No signal-dependent distortion. The only cost: both sides must share the dither sequence (a shared random seed suffices).

Why Subtractive Is Superior (When Possible)

Let's compute the output noise for both architectures with Δ = 1:

Non-subtractive: Output = Q(x + d). Error = Q(x + d) − x = (x + d + η) − x = d + η. The noise power is E[(d + η)2]. Since η is not independent of d (both depend on where x + d lands), the exact power depends on x. But it's always ≥ Δ2/12 and on average about Δ2/6.

Subtractive: Output = Q(x + d) − d. Error = Q(x + d) − d − x = η. The noise power is E[η2] = Δ2/12 exactly. No d contribution remains.

The difference is significant: subtractive dither achieves 3 dB lower noise than non-subtractive, with the same linearization benefit. The only drawback is the coordination requirement.

Implementation in Digital Systems

In a real system, the "shared dither" is implemented as a pseudo-random number generator (PRNG) initialized with the same seed at both encoder and decoder. Popular choices:

Linear congruential generator: xn+1 = (a·xn + c) mod m. Fast, but limited statistical quality.

LFSR (Linear Feedback Shift Register): Standard in hardware implementations (audio DACs, ADCs). Maximum-length sequences of period 2N−1.

Xorshift: Modern choice. Excellent statistical properties, very fast in software.

The seed can be transmitted as metadata (e.g., in a packet header). For real-time audio, the encoder and decoder often derive the seed from frame number or timestamp — no explicit seed transmission needed if they share a clock.

Non-Subtractive vs. Subtractive: Comparison

PropertyNon-SubtractiveSubtractive
ArchitectureAdd d, quantize, outputAdd d, quantize, subtract d
Error independenceYes (linearized)Yes (linearized)
Output noise power> Δ2/12 (d remains)= Δ2/12 (d removed)
Requires shared dNoYes (shared seed)
Use caseOne-way (audio/image display)Two-way (comm systems)

Practical Implementation

In practice, "sharing the dither" means both encoder and decoder initialize a pseudorandom number generator with the same seed. Each sample uses the next PRN in the sequence. The overhead is transmitting one seed value (e.g., 32 bits) at the start of a frame, not one random number per sample.

In digital audio, subtractive dither is used in noise-shaping loops (sigma-delta DACs). In digital communications, it appears in spread-spectrum systems. In ML quantization, it appears as stochastic rounding — which we cover next.

Subtractive vs. Non-Subtractive

Compare the output error for both architectures applied to a sine wave. Notice the non-subtractive has higher variance but both have uncorrelated error.

What is the noise power of subtractive dither with step size Δ?

Chapter 4: Stochastic Rounding

Stochastic rounding is dithering stripped to its mathematical essence. Instead of adding noise then rounding, we directly randomize which way we round — with probabilities chosen to make the result unbiased.

The Definition

Given input x and quantization step Δ, let x = Δ·floor(x/Δ) be the level below x, and x = x + Δ be the level above. Stochastic rounding outputs:

SR(x) = x   with probability   p = (x − x) / Δ
SR(x) = x   with probability   1 − p

The probability of rounding up equals the fractional position within the bin. If x is 70% of the way from x to x, it rounds up 70% of the time.

The key property — UNBIASEDNESS:

E[SR(x)] = p · x + (1−p) · x
= ((x − x)/Δ) · (x + Δ) + (1 − (x − x)/Δ) · x
= x + (x − x) = x

The expected output equals the input exactly. No bias, ever.

Why This Matters for ML

In deep learning, gradients can be very small. Consider training a large language model in FP8 format with Δ = 0.01. A parameter receives a gradient of 0.003 — smaller than one step size.

Deterministic rounding: round(0.003 / 0.01) · 0.01 = round(0.3) · 0.01 = 0. The gradient vanishes. Over 1000 steps, this parameter accumulates 1000 × 0 = 0 total update. Learning is killed.

Stochastic rounding: P(round up) = 0.3. Over 1000 steps, about 300 of them round up to 0.01. The accumulated update is ≈ 300 × 0.01 = 3.0. The expected accumulated update is 1000 × 0.003 = 3.0. It works!

The gradient accumulation insight: With deterministic rounding, gradients smaller than Δ vanish permanently. With stochastic rounding, small gradients accumulate correctly in expectation. This is the single reason stochastic rounding enables low-precision ML training.

Variance of Stochastic Rounding

The error ε = SR(x) − x has:

• E[ε] = 0 (unbiased)

• Var(ε) = p(1−p)Δ2 ≤ Δ2/4

Maximum variance occurs at p = 0.5 (when x is exactly between two levels). Minimum variance is 0 (when x is already a quantization level, p = 0 or 1). The average variance over uniformly distributed inputs is Δ2/12 — same as the standard quantization noise formula.

Detailed Variance Derivation

Let f = (x − x)/Δ be the fractional position. Then p = f, and:

E[ε2] = p·(x − x)2 + (1−p)·(x − x)2
= f · ((1−f)Δ)2 + (1−f) · (fΔ)2
= f(1−f)2Δ2 + (1−f)f2Δ2
= f(1−f)Δ2[(1−f) + f] = f(1−f)Δ2

This is maximized at f = 0.5: Var = 0.25Δ2. Averaging over f ~ Uniform[0,1]:

Ef[f(1−f)] = ∫01 f(1−f) df = 1/2 − 1/3 = 1/6

Wait — that gives Δ2/6, not Δ2/12. The discrepancy: the Δ2/12 formula assumes uniform error distribution within each bin (true for deterministic rounding with uniform input). For stochastic rounding, the average variance is Δ2/6 if we average over fractional positions. But the MSE (which equals variance since bias = 0) is indeed bounded by Δ2/4 for any single input.

Comparison: Deterministic vs. Stochastic for a Fixed x

Consider x = 2.3, Δ = 1:

Deterministic (round-to-nearest): Q(2.3) = 2. Error = −0.3. Always. MSE = 0.09.

Stochastic: P(round to 3) = 0.3, P(round to 2) = 0.7. E[error] = 0. Var = 0.3·0.7·1 = 0.21.

The stochastic rounding has HIGHER variance for this single sample! The benefit isn't lower per-sample error — it's the unbiasedness that lets errors cancel over many samples. Over 100 gradient updates of 0.3 each, deterministic always gives 0 (if starting at an integer), while stochastic gives approximately 30 (the true accumulated value).

Showcase: Gradient Accumulation Comparison

Gradient Accumulation: Deterministic vs. Stochastic

A parameter starts at 0. Each step, it receives a small gradient g (smaller than Δ). Watch how deterministic rounding kills the update while stochastic rounding tracks the true value.

Gradient g 0.15
Step Δ 1.0
Press Run to simulate gradient accumulation in low precision.
A parameter at value 2.0 receives gradient 0.3 with Δ = 1.0. What is E[SR(2.0 + 0.3)]?

Chapter 5: NVFP4 & GPU Quantization

NVIDIA's Blackwell architecture (2024) introduced hardware support for 4-bit floating point — FP4 (also called NVFP4). This makes the theory of this lecture immediately practical: how do you train and infer with only 4 bits per weight?

FP4 Format

NVFP4 uses 4 bits total:

• 1 sign bit

• 2 exponent bits (E2)

• 1 mantissa bit (M1)

With bias = 1, the representable values are:

Bits (s e1e0 m0)ValueBitsValue
0 00 000 00 10.5
0 01 010 01 11.5
0 10 020 10 13
0 11 040 11 16

Plus negatives by flipping the sign bit. Total: 16 values (but +0 = -0, so effectively 15 distinct values).

Notice the non-uniform spacing: {0, 0.5, 1, 1.5, 2, 3, 4, 6}. The gaps grow exponentially (0.5, 0.5, 0.5, 0.5, 1, 1, 2). This is floating-point's inherent property: more precision near zero, less near the extremes. This naturally matches the distribution of neural network weights (concentrated near zero).

Connection to Lloyd-Max: The non-uniform spacing of FP4 is analogous to an optimal quantizer for a distribution concentrated near zero. Floating-point formats achieve non-uniform quantization "for free" via the exponent — no need to store a separate codebook.

Block Scaling

Raw FP4 can only represent values in {0, 0.5, ..., 6}. Real weights span different ranges. The solution: block scaling. Group weights into blocks of 32 (or 64 or 128). For each block, compute a scale factor s = max(|w|) / 6. Store the scale in higher precision (FP8 or FP16). Each weight is quantized as:

wq = s · QFP4(w / s)

The block scale adds a small overhead (e.g., 1 FP8 value per 32 FP4 weights = 8/(32×4) = 6.25% overhead). But it adapts the dynamic range to each local block of weights.

Stochastic Rounding in FP4 Training

When training with FP4 weights, gradient updates are tiny. Deterministic rounding to the nearest FP4 value would kill learning. Stochastic rounding is essential:

Compute gradient g in FP16/BF16
Full-precision backward pass
Update: w' = w + α·g (in FP16)
Full-precision accumulation
Store: wq = SRFP4(w')
Stochastic round to nearest FP4 value
Forward pass uses wq
FP4 weights for fast matmuls

The "master copy" in FP16 accumulates exact gradients. Only the forward-pass copy is quantized to FP4. This hybrid approach gets the speed of FP4 matmuls with the correctness of FP16 accumulation.

NVFP4 Performance Numbers

On NVIDIA Blackwell (B200):

• FP4 matmul throughput: 2x that of FP8, 4x that of FP16

• Memory footprint: 4 bits/weight vs. 16 bits = 4x compression

• A 70B-parameter model fits in a single GPU (70B × 4 bits = 35 GB)

• With block scaling overhead: ~4.5 effective bits/weight

The combination of non-uniform spacing (FP4's exponential format), block scaling (local dynamic range adaptation), and stochastic rounding (unbiased training) makes 4-bit training viable. Without any one of these three components, quality degrades significantly.

Connection to NF4 (Lecture 4)

QLoRA's NF4 format (from Lecture 4's Lloyd-Max theory) uses a 4-bit codebook optimized for Gaussian-distributed weights. NVFP4 uses the standard floating-point format instead. The trade-off:

FormatLevel spacingHardwareUse case
NF4 (QLoRA)Lloyd-Max optimal for N(0,1)Software lookup tableInference only
NVFP4Exponential (floating-point)Native GPU hardwareTraining + inference

NF4 has slightly lower quantization error for Gaussian inputs, but NVFP4 runs in hardware at full speed. For training, the hardware speed advantage dominates.

FP4 Number Line

The 16 representable FP4 values on a number line. Notice the non-uniform spacing: denser near zero, sparser at extremes. Drag the input to see which FP4 value it quantizes to.

Input value 1.70
Why is stochastic rounding essential for FP4 training?

Chapter 6: Perceptual Quality

We've focused on mathematical properties: bias, variance, independence. But dithering was invented for a perceptual reason — making quantized signals look and sound better to humans. Let's understand why trading structured error for noise is a good deal.

The Masking Effect

Human perception is remarkably good at ignoring broadband noise but terrible at ignoring periodic patterns. Consider:

Audio: A uniform noise floor at −60 dB is inaudible. But a pure tone at −60 dB (a limit cycle from quantization) is clearly audible. The ear's frequency selectivity makes tonal artifacts 20-30 dB more offensive than noise of the same power.

Images: Film grain (random noise) is unobtrusive and even aesthetically pleasing. But banding (contour lines from quantization) is immediately ugly. The visual system's edge-detection circuitry highlights artificial sharp boundaries that don't exist in the scene.

The perceptual trade-off: Dithering increases total error energy slightly (non-subtractive case) or keeps it the same (subtractive case), but converts ALL of that energy from structured to random form. Since structured artifacts are 20-30 dB more perceptually offensive, this is a massive net win.

Types of Dither Noise

Pilanci distinguishes by the probability density of the dither signal:

Rectangular (RPDF): Uniform on [−Δ/2, Δ/2]. This is "first-order" dither. It ensures the first moment (mean) of the error is independent of the input, but the second moment (variance) may still depend on x.

Triangular (TPDF): The sum of two independent rectangular dithers. Has a triangular distribution on [−Δ, Δ]. This ensures both the first AND second moments of the error are independent of the input. Standard in audio CD mastering.

Gaussian: N(0, σ2) with σ ≈ Δ/3. Provides even smoother decorrelation but with unbounded support (can push signal into overload). Used in some imaging applications.

TPDF Dither: Why Audio Engineers Use It

For 16-bit audio (CD quality), Δ = 2/65536 ≈ 3×10−5. When mastering a 24-bit recording down to 16-bit, the last 8 bits are lost. Without dither, quiet passages develop audible quantization noise with harmonic content. With TPDF dither:

• The noise floor rises by Δ2/6 (about −93 dB), which is below the threshold of hearing

• All signal-correlated distortion products are eliminated

• The result sounds "analog" rather than "digital"

Shaped Dither

Going further: noise shaping moves the dither energy into frequency bands where human perception is least sensitive (above 15 kHz for audio, in smooth regions for images). The total noise power increases, but the perceived noise decreases. This is the final refinement in professional audio mastering.

The Quantitative Case

Consider reducing 24-bit audio to 16-bit. Without dither:

• Dynamic range: 96 dB (16 bits × 6.02 dB/bit)

• Quantization noise is correlated with the signal → audible as distortion

• Below −70 dB, harmonics and intermodulation products are visible in spectrogram

With TPDF dither:

• Noise floor rises ~3 dB (from −96 dB to −93 dB) — still inaudible

• ALL signal-correlated distortion products eliminated

• Clean spectrogram even at −100 dB: just flat noise floor

With shaped dither (e.g., Pow-R Type 3):

• Noise floor at 4 kHz (where hearing is most sensitive): −110 dB

• Noise floor at 16–20 kHz (where hearing is weakest): −80 dB

• Total noise power is HIGHER than TPDF, but perceived noise is LOWER

• Effective dynamic range: ~120 dB in the sensitive band (exceeding 16-bit nominal!)

Dithering in Image Processing

The visual analog is error diffusion (Floyd-Steinberg dithering). When reducing color depth (e.g., 24-bit to 8-bit palette), the quantization error at each pixel is distributed to neighboring pixels. This spreads the error spatially, converting visible banding into fine-grained texture. The result looks dramatically better despite identical total error energy.

The principle is the same: structured error (visible contours) is far more offensive than noise-like error (invisible texture). Any technique that randomizes or diffuses the error pattern improves perceptual quality without reducing mathematical MSE.

Gradient Image: Banding vs. Dithered

A simulated gradient quantized to few levels. Toggle dither to see banding disappear. Adjust bits to control severity.

Bits 3
Triangular (TPDF) dither ensures independence of which moments of the error from the input?

Chapter 7: Mastery

Let's consolidate. Dithering and stochastic rounding are two faces of the same idea: randomization converts deterministic quantization error into statistically well-behaved noise.

Summary Table

MethodBiasNoise PowerError ⊥ Input?Use Case
Deterministic roundingBiased (for non-midpoints)Δ2/12No (sawtooth)Simple DAC/ADC
Non-subtractive ditherUnbiased (in limit)> Δ2/12YesAudio/image display
Subtractive ditherUnbiased= Δ2/12YesCommunications, DSP
Stochastic roundingUnbiased (exactly)≤ Δ2/4YesML training

Connections

Lecture 3 (Quantization Noise): We assumed the error was white and uniform — dithering is what makes that assumption true.

Lecture 4 (Lloyd-Max): Non-uniform quantization + stochastic rounding = modern LLM compression (NF4 in QLoRA uses the Lloyd-Max codebook from Lecture 4 with stochastic rounding from this lecture).

Lecture 6 (DFT): The spectral analysis of quantization error — dithering whitens the error spectrum, making it flat across all frequencies instead of concentrated at specific tones.

Upcoming (Rate-Distortion): Dithering connects to the random coding argument in Shannon's rate-distortion theorem — randomization achieves the theoretical limit.

Open Research Questions

Optimal dither for non-uniform quantizers: The linearization theorem assumes uniform (rounding) quantizers. For Lloyd-Max or learned codebooks, what dither distribution is optimal? Active research area.

Dither in vector quantization: Extending dither to multi-dimensional quantizers (used in VQ-VAE, product quantization) is non-trivial. The "one bin width" intuition doesn't generalize cleanly to high dimensions.

Stochastic rounding hardware: Implementing true stochastic rounding requires a random number generator per arithmetic unit. Current GPUs approximate this with deterministic pseudo-random sequences. The impact on training quality is still being studied.

Adaptive stochastic rounding: Can we do better than uniform probability? Recent work (Hou et al., 2023) explores "temperature-scaled" stochastic rounding where the probability curve is steeper near 0 and 1, reducing variance while maintaining unbiasedness.

Pilanci's Key Takeaways

From the lecture slides, three messages stand out:

1. Randomization is a tool, not a weakness. Adding noise seems like corruption, but it's actually regularization. The noise breaks pathological patterns that deterministic systems get stuck in.

2. Unbiasedness enables accumulation. In iterative algorithms (gradient descent, sigma-delta modulation, adaptive filters), unbiased errors average out over time. Biased errors accumulate and grow. This is why stochastic rounding works for training: each step is noisy, but the trajectory tracks the true optimum.

3. The right trade-off depends on the application. For display (images, audio), non-subtractive dither suffices — the extra noise is imperceptible. For precise computation (ML training), stochastic rounding gives mathematical guarantees. For communication systems, subtractive dither achieves minimum noise with shared randomness. No single technique dominates all contexts.

Pilanci's message: "Quantization is fundamentally lossy. You can't eliminate the error. But you can control its character. Dithering converts adversarial, structured error into friendly, noise-like error. Stochastic rounding ensures the error is unbiased. Together, they make aggressive quantization viable in modern ML."
Which technique is essential for preserving small gradients during FP4/FP8 training?

Worked Comparison: 8-bit Audio Quantization

Consider quantizing a 16-bit audio signal to 8 bits (Δ = 256 LSBs of the original):

MethodNoise PowerPerceptual Quality
Truncation (drop 8 LSBs)Δ2/12 ≈ 5461Terrible: DC bias + harmonics
Rounding (no dither)Δ2/12 ≈ 5461Bad: no bias but tonal artifacts
RPDF dither2/6 ≈ 10922Good: noise-like error, +3dB floor
TPDF dither2/4 ≈ 16384Best: completely decorrelated error
Subtractive RPDFΔ2/12 ≈ 5461Best: lowest noise + decorrelated

TPDF has higher noise than RPDF, but the perceptual improvement from full second-moment independence outweighs the 3 dB penalty for most applications. This is why CD mastering universally uses TPDF.

The Connection to Information Theory

Dithering has a deep connection to Shannon's rate-distortion theory. The random coding argument in Shannon's achievability proof essentially says: "randomize before encoding, then the codebook errors become noise-like." Dithering is the practical embodiment of this: randomize (add d), quantize (encode), and the resulting distortion has the best possible statistical properties.

In fact, the dithered quantizer achieves the rate-distortion bound for Gaussian sources in the high-rate regime. This isn't a coincidence — it's because dithering creates the exact conditions (independence, uniformity) that Shannon's theory assumes.

Final thought: "The combination of noise and nonlinearity does not result in a loss of information, but rather in a change of the form in which information is represented." — Lawrence Roberts, 1962 (MIT thesis on dithered quantization in images)