Changing basis to reveal hidden structure — from shifted deltas to complex exponentials, and O(N log N) computation.
You have a signal: 1024 audio samples, or 256 pixel values in a row of an image, or 512 sensor readings. In the time domain (or spatial domain), this is just a list of numbers. Each number tells you "what happened at this instant." Useful, but limited.
Here's the problem: in the time domain, many signals that look completely different share deep structure. A pure 440 Hz sine wave and a 440 Hz sine wave shifted by half a period look totally different sample-by-sample — one starts positive, the other negative. But they're "the same note." The structure (single frequency, 440 Hz) is invisible in the time representation.
The fix: represent the signal in a different basis. Instead of "how much of each shifted delta is in this signal," ask "how much of each frequency is in this signal." That's the Fourier basis.
The standard basis for N-dimensional signal space is the set of shifted deltas:
Any signal x can be written as x = x[0]·e0 + x[1]·e1 + ... + x[N−1]·eN−1. The coefficients are just the signal samples themselves. This representation is trivial — it tells us nothing new.
The Fourier transform says: represent x in a different basis, where the basis vectors are complex exponentials (sinusoids). The coefficients in this new basis are the frequency components.
A signal made of two sinusoids. In the time domain, it's hard to see the components. In the frequency domain, they're obvious: two spikes.
Before we can define the Fourier basis, we need to be precise about what makes a "good" basis. The answer: orthonormality.
For two N-dimensional complex vectors u, v, the inner product is:
where v[n]* is the complex conjugate. The conjugate is essential for complex vectors — without it, ⟨v, v⟩ wouldn't give a real positive number.
A set of vectors {v0, ..., vM−1} is linearly independent if no vector can be written as a combination of the others. Equivalently: the only solution to c0v0 + c1v1 + ... = 0 is all ck = 0.
If we have N linearly independent vectors in N-dimensional space, they form a basis — any vector can be uniquely decomposed into this basis.
A set of vectors is orthogonal if every pair has zero inner product:
It's orthonormal if additionally each vector has unit norm:
Given an orthonormal basis {v0, ..., vN−1}, any vector x can be written:
This is analysis (finding coefficients) and synthesis (reconstructing from coefficients). The DFT is exactly this, with the Fourier vectors as the basis.
Before orthonormal bases, decomposing a signal required solving a system of N equations in N unknowns — O(N3) computation for Gaussian elimination. With an orthonormal basis, each coefficient is computed independently via a single inner product — O(N) per coefficient, O(N2) total for all N coefficients, or O(N log N) with the FFT.
Moreover, truncating the expansion (keeping only the largest coefficients) gives the best approximation in the least-squares sense. This is the foundation of transform coding: DFT/DCT the signal, keep the big coefficients, throw away the small ones. The orthonormality guarantees this is optimal.
Think of the standard basis e0, e1 in 2D. They're orthonormal: perpendicular, unit length. To find the x-component of a point, you project onto e0. To find the y-component, project onto e1. The projections are inner products.
The Fourier basis does the same thing in N dimensions, but the basis vectors are complex exponentials (spinning phasors) instead of coordinate axes. Projecting onto each frequency basis vector gives you "how much of that frequency is present."
Without orthogonality, decomposition becomes computationally expensive and numerically unstable. Consider a non-orthogonal basis {v0, v1}. To find the coefficients c0, c1 such that x = c0v0 + c1v1, you must solve a linear system — invert the Gram matrix G = [⟨vi, vj⟩]. For N dimensions, that's O(N3) for a general basis, vs. O(N) for an orthonormal one (just compute N inner products).
Moreover, with a non-orthogonal basis, small errors in the coefficients can cause large errors in the reconstruction (the condition number of G may be huge). Orthonormal bases have condition number exactly 1 — perfect numerical stability.
A crucial consequence of orthonormality: the "length" (energy) of a vector is the same whether measured in the original coordinates or the expansion coefficients:
No energy is created or destroyed by changing to an orthonormal basis. This guarantees that quantizing in the transform domain causes the same total distortion as quantizing in the time domain (up to how coefficients are allocated). This is the mathematical foundation of transform coding (JPEG, MP3).
Drag the vector x. Its projections onto the two orthogonal basis vectors (orange, teal) are the coordinates.
Now the main event. We define N special vectors that form an orthonormal basis for CN. These are the Fourier basis vectors (also called DFT basis vectors or frequency-domain basis vectors).
For m = 0, 1, ..., N−1, the m-th Fourier basis vector is:
Each wm is a sampled complex exponential at frequency m cycles per N samples, normalized by 1/√N to have unit norm.
Expanding ejθ = cosθ + j sinθ:
So wm is a complex sinusoid that completes exactly m full cycles over N samples.
Define WN = e−j2π/N (the "N-th root of unity"). Then all DFT computations use powers of this single complex number:
For N = 8: W8 = e−jπ/4 = cos(45°) − j sin(45°) = (√2/2)(1 − j). All 64 entries of the 8×8 DFT matrix are powers of this single number. The FFT exploits the algebraic structure of these powers (periodicity, symmetry) to avoid redundant computation.
Key identities:
• WNN = 1 (periodicity)
• WNN/2 = −1 (half-period gives negation)
• WNk+N = WNk (periodicity in k)
• W2N2k = WNk (decimation property — the FFT uses this)
We need to show ⟨wk, wr⟩ = δ[k−r]. Compute:
Let ω = ej2π(k−r)/N. Then the sum becomes (1/N) ∑n=0N−1 ωn.
Case 1: k = r. Then ω = e0 = 1. Sum = (1/N) · N = 1. Correct: the norm is 1.
Case 2: k ≠ r. Then ω ≠ 1, and we use the geometric series formula:
Now ωN = ej2π(k−r) = 1 (since k−r is an integer). So the numerator is 1 − 1 = 0. The sum is zero.
For N = 8, the basis vectors are 8 complex exponentials:
• w0: constant (DC) — all samples equal 1/√8
• w1: one full cycle over 8 samples
• w2: two full cycles over 8 samples
• w3: three full cycles...
• w4: four full cycles (Nyquist frequency — alternating ±1)
Select a frequency index m to see the real and imaginary parts of wm. Note how m cycles fit in N samples.
Now we apply the orthonormal basis machinery. "Analysis" (find coefficients) gives us the DFT. "Synthesis" (reconstruct from coefficients) gives us the IDFT.
Given a signal x[n] of length N, its DFT coefficients are:
Each X[k] is the inner product of x with the k-th Fourier basis vector (without the 1/√N normalization — by convention, the DFT absorbs both 1/√N factors into the IDFT as a single 1/N).
X[k] is complex. Its magnitude |X[k]| tells you the amplitude of frequency k. Its phase ∠X[k] tells you the phase offset of frequency k.
Bin k = 0 is the DC component: X[0] = ∑ x[n] = N × (mean of x). It measures the average value of the signal.
Bin k = 1 represents one full cycle over N samples, i.e., frequency f1 = fs/N Hz. For N = 1024 and fs = 44100 Hz, this is 43.07 Hz.
Bin k represents frequency fk = k · fs/N Hz. The frequency resolution (spacing between bins) is Δf = fs/N. To resolve two frequencies separated by Δf Hz, you need at least N = fs/Δf samples.
Bins k = N/2 + 1 to N − 1 are the "negative frequencies" (or equivalently, frequencies above Nyquist). For real signals, they're redundant: X[N−k] = X[k]*. You only use bins 0 to N/2.
To reconstruct x from its DFT coefficients:
This says: the signal is the sum of N complex exponentials, each weighted by its DFT coefficient X[k], normalized by 1/N.
Both operations can be written as matrix-vector products. Define the DFT matrix F of size N×N:
Then:
where FH is the conjugate transpose (Hermitian) of F. Since F is N√N times a unitary matrix: FHF = N·I. Equivalently, (1/√N)F is unitary.
Let W = e−j2π/4 = e−jπ/2 = −j. The DFT matrix for N=4 is:
Substituting W = −j, W2 = −1, W3 = j:
You can verify: every row is orthogonal to every other row (inner product = 0). Each row has norm √4 = 2.
Row 0 = [1, 1, 1, 1]. Row 1 = [1, −j, −1, j]. Inner product:
Row 0 vs. Row 2: ⟨[1,1,1,1], [1,−1,1,−1]⟩ = 1 − 1 + 1 − 1 = 0 ✓
Row 1 norm: |1|2 + |−j|2 + |−1|2 + |j|2 = 1 + 1 + 1 + 1 = 4 = N. So (1/√N)·row has unit norm.
The DFT matrix: each row k is a complex exponential at frequency k. Color = phase, brightness = magnitude (all magnitudes are 1).
The DFT has elegant algebraic properties that make it far more than a decomposition tool. These properties are why Fourier methods dominate signal processing.
The DFT of a sum is the sum of DFTs. Obvious from linearity of summation, but powerful: analyze complex signals by analyzing their components separately.
If x[n] has DFT X[k], then the circularly shifted signal x[(n−m) mod N] has DFT:
A time delay of m samples multiplies each frequency coefficient by a linear phase factor. The magnitude spectrum is unchanged — only the phase shifts. This is why |X[k]| represents the "frequency content" regardless of where the signal starts.
Multiplying the signal by a complex exponential shifts its spectrum:
This is the dual of the time-shift property. Modulation in time = shift in frequency.
Total energy in time = total energy in frequency (up to the 1/N factor from our normalization convention). The DFT is an energy-preserving transformation — it rotates the vector, never stretches or shrinks it.
where ⊛ denotes circular convolution. Convolution in time becomes multiplication in frequency. This is the single most important DFT property for computation:
• Direct convolution: O(N2) multiplications
• Via FFT: DFT both (O(N log N)), multiply pointwise (O(N)), IDFT (O(N log N)). Total: O(N log N)
Suppose you want to apply a 1000-tap FIR filter h[n] to a signal x[n] of length 100,000:
• Direct convolution: 1000 × 100,000 = 108 multiplies
• FFT method: Zero-pad both to N = 131,072 (next power of 2 ≥ 100,000 + 999). Three FFTs of size N: 3 × N·log2(N) ≈ 3 × 131,072 × 17 ≈ 6.7 × 106 operations. Plus N pointwise multiplies. Total ≈ 7 × 106.
• Speedup: 14×. For longer filters, the speedup is even larger.
In audio, convolution reverb (applying a room impulse response of 1–3 seconds = 44,100–132,300 taps) is ONLY feasible via FFT. Direct convolution would require billions of operations per second.
The DFT properties exhibit a beautiful duality: whatever holds for the time domain also holds (with a sign flip or N factor) for the frequency domain:
• Convolution in time ↔ pointwise multiplication in frequency
• Pointwise multiplication in time ↔ circular convolution in frequency (divided by N)
• Shift in time ↔ phase ramp in frequency
• Phase ramp in time ↔ shift in frequency
This duality comes from the symmetric structure of the DFT and IDFT formulas (they differ only by the sign in the exponent and the 1/N factor).
| Property | Time Domain | Frequency Domain |
|---|---|---|
| Linearity | ax + by | aX + bY |
| Time shift | x[(n-m) mod N] | X[k]·e-j2πkm/N |
| Freq shift | x[n]·ej2πm0n/N | X[(k-m0) mod N] |
| Convolution | x ⊛ y | X · Y (pointwise) |
| Parseval | ∑|x[n]|2 | (1/N)∑|X[k]|2 |
Two short signals are convolved. Compare the direct computation (blue) with the FFT approach (orange overlay). They match exactly.
The DFT formula requires N multiplications for each of N coefficients: O(N2) total. For N = 106 (one second of audio at 1 MHz), that's 1012 operations. Impractical.
The Fast Fourier Transform (FFT) computes the same result in O(N log N). For N = 106, that's 2×107 — a speedup of 50,000×. The FFT is one of the most important algorithms in computational science.
Start with the DFT of an N-point signal (N = 2m, a power of 2):
Split the sum into even-indexed and odd-indexed terms:
Note that WN2 = e−j2π·2/N = e−j2π/(N/2) = WN/2. So:
Define A[k] = DFT of even samples, B[k] = DFT of odd samples (both length N/2):
One N-point DFT becomes two N/2-point DFTs plus N/2 butterflies. Apply recursively:
For N = 8: direct DFT = 64 multiplies. FFT = 8·log2(8) = 24 multiplies. For N = 1024: direct = 1,048,576. FFT = 10,240. Ratio: 102×.
The FFT can be drawn as a network of butterflies, organized in log2(N) stages. Each stage has N/2 butterflies. The input order is bit-reversed.
Step through the FFT computation stage by stage. Watch how 8 inputs are combined through 3 stages of butterflies to produce 8 frequency coefficients.
The key algebraic identity is WNk+N/2 = −WNk. This is because:
So X[k + N/2] reuses the SAME products A[k] and WNk·B[k] computed for X[k], just with a sign flip. This "buy one, get one free" is the source of the O(N log N) savings — we compute N/2 multiplications per stage, and there are log2(N) stages.
The FFT can be computed in-place: using only O(N) memory (the input array itself). At each stage, each butterfly reads two values and writes two values back to the same locations. This is why the FFT is so practical — no auxiliary storage beyond the input buffer.
The input to the FFT butterfly network is in bit-reversed order. For N = 8 (3 bits):
| Index | Binary | Reversed | Input position |
|---|---|---|---|
| 0 | 000 | 000 | x[0] |
| 1 | 001 | 100 | x[4] |
| 2 | 010 | 010 | x[2] |
| 3 | 011 | 110 | x[6] |
| 4 | 100 | 001 | x[1] |
| 5 | 101 | 101 | x[5] |
| 6 | 110 | 011 | x[3] |
| 7 | 111 | 111 | x[7] |
This reordering groups even-indexed and odd-indexed elements recursively, which is exactly what the divide-and-conquer decomposition needs.
The radix-2 Cooley-Tukey FFT requires N to be a power of 2. Extensions exist:
• Radix-4: Split into 4 sub-problems. Slightly fewer multiplies than radix-2 (saves ~25%).
• Split-radix: Combines radix-2 and radix-4 optimally. The best known algorithm for power-of-2 sizes.
• Mixed-radix: Handles any N = n1·n2·...·nk. FFTW library uses this.
• Bluestein (chirp-z): Reduces arbitrary-size DFT to a convolution, then uses FFT. Works for ANY N.
In practice, libraries like FFTW ("Fastest Fourier Transform in the West") choose the optimal algorithm automatically based on N and hardware. You just call `fft(x)` and get O(N log N) performance regardless of N.
| N | Direct DFT (N2) | FFT (N log2 N) | Speedup |
|---|---|---|---|
| 8 | 64 | 24 | 2.7× |
| 64 | 4,096 | 384 | 10.7× |
| 1,024 | 1,048,576 | 10,240 | 102× |
| 65,536 | 4.3×109 | 1.05×106 | 4,096× |
| 1,048,576 | 1.1×1012 | 2.1×107 | 52,429× |
At audio sample rates (44.1 kHz), processing 1 second requires N = 44,100. The FFT makes real-time spectral analysis possible on even modest hardware.
python import numpy as np # Direct DFT (O(N^2) — for understanding only) def dft_direct(x): N = len(x) X = np.zeros(N, dtype=complex) for k in range(N): for n in range(N): X[k] += x[n] * np.exp(-1j * 2 * np.pi * k * n / N) return X # FFT via numpy (O(N log N) — use this) X = np.fft.fft(x) # DFT x_back = np.fft.ifft(X) # IDFT mags = np.abs(X) # magnitude spectrum phases = np.angle(X) # phase spectrum
Let's compute DFTs by hand for small N. This builds intuition for what the DFT actually does to simple signals.
X[0] = 1 + 1 + 1 + 1 = 4
X[1] = 1 + W + W2 + W3 = 1 + (−j) + (−1) + j = 0
X[2] = 1 + W2 + W4 + W6 = 1 + (−1) + 1 + (−1) = 0
X[3] = 1 + W3 + W6 + W9 = 1 + j + (−1) + (−j) = 0
Result: X = [4, 0, 0, 0]. A constant signal has all its energy at DC (frequency 0). Makes sense — no oscillation means no non-zero frequency components.
X[0] = 1 + (−1) + 1 + (−1) = 0
X[1] = 1 + (−1)(−j) + 1(−1) + (−1)(j) = 1 + j − 1 − j = 0
X[2] = 1 + (−1)(−1) + 1(1) + (−1)(−1) = 1 + 1 + 1 + 1 = 4
X[3] = 1 + (−1)(j) + 1(−1) + (−1)(−j) = 1 − j − 1 + j = 0
Result: X = [0, 0, 4, 0]. All energy at k=2, the Nyquist frequency (N/2 = 2 cycles per N samples = maximum oscillation rate). An alternating signal IS the Nyquist frequency.
This is cos(2πn/4) = cos(πn/2) = [1, 0, −1, 0].
X[0] = 1 + 0 + (−1) + 0 = 0
X[1] = 1 + 0·(−j) + (−1)(−1) + 0·j = 1 + 1 = 2
X[2] = 1 + 0·(−1) + (−1)·1 + 0·(−1) = 0
X[3] = 1 + 0·j + (−1)(−1) + 0·(−j) = 1 + 1 = 2
Result: X = [0, 2, 0, 2]. Energy at k=1 and k=3 (= N−1). This is the "mirror symmetry" for real signals: X[N−k] = X[k]* (conjugate). Since cos = (ejθ + e−jθ)/2, it has both positive and negative frequency components.
X[0] = 1 + 2 + 3 + 4 = 10 (the sum — DC value)
X[1] = 1 + 2(−j) + 3(−1) + 4(j) = 1 − 2j − 3 + 4j = −2 + 2j
X[2] = 1 + 2(−1) + 3(1) + 4(−1) = 1 − 2 + 3 − 4 = −2
X[3] = 1 + 2(j) + 3(−1) + 4(−j) = 1 + 2j − 3 − 4j = −2 − 2j
Check conjugate symmetry: X[3] = X[4−1] should equal X[1]* = (−2 + 2j)* = −2 − 2j. Confirmed!
Magnitudes: |X| = [10, 2√2, 2, 2√2] ≈ [10, 2.83, 2, 2.83]. The DC component dominates because the ramp has a large mean value (2.5). The non-zero frequency components reflect the linear trend.
Time-domain energy: 12 + 22 + 32 + 42 = 30.
Frequency-domain: (1/4)(102 + 8 + 4 + 8) = (1/4)(120) = 30. Checks out!
Enter 4 real values and see the DFT computed step by step.
The DFT is the workhorse of digital signal processing. Let's consolidate the key ideas and connect forward.
| Concept | Formula | Complexity |
|---|---|---|
| DFT (analysis) | X[k] = ∑ x[n]e-j2πkn/N | O(N2) direct |
| IDFT (synthesis) | x[n] = (1/N)∑ X[k]ej2πkn/N | O(N2) direct |
| FFT | Cooley-Tukey divide & conquer | O(N log N) |
| Orthogonality | ∑ ej2π(k-r)n/N = Nδ[k-r] | — |
• Lecture 7 (Spectral Descriptors): We'll use DFT magnitudes to extract audio features like spectral centroid, spread, and entropy.
• Lecture 8+ (Compression): The DFT (and its cousin the DCT) enable transform coding — JPEG, MP3, video codecs. Energy compaction in the frequency domain allows aggressive quantization of "unimportant" coefficients.
• ML Connection: Self-attention in transformers can be viewed as computing pairwise correlations — related to Fourier-domain filtering. FNet (2021) replaced attention with FFT and achieved 92% of BERT's accuracy with 7× faster training.
The continuous Fourier Transform was discovered by Joseph Fourier in 1807 (to solve the heat equation). The DFT was known to Gauss in 1805, but remained computationally impractical until Cooley and Tukey published their FFT algorithm in 1965. Within a decade:
• 1966: First real-time spectrum analyzers using FFT
• 1969: FFT used in X-ray crystallography (determining protein structures)
• 1970s: FFT enables digital audio processing (synthesizers, effects)
• 1980s: FFT in image compression (precursor to JPEG)
• 1990s: MP3 compression, OFDM in wireless communications
• 2000s+: 4G/5G OFDM, WiFi, convolution in neural networks
Today, FFT operations are so fundamental that many processors have dedicated FFT hardware (DSP chips, neural accelerators with FFT units).
• Circular (not linear) convolution: The DFT computes circular convolution. For linear convolution, zero-pad both signals to length ≥ Nx + Nh − 1.
• Spectral leakage: A finite-length signal appears to have energy at all frequencies (the DFT of a windowed sinusoid isn't a perfect delta). Windowing (Hann, Hamming) reduces leakage at the cost of frequency resolution.
• Assumes periodicity: The DFT treats the signal as periodic with period N. Discontinuities at the boundaries (first sample ≠ last sample) create leakage. Windowing mitigates this.
• Fixed resolution: Frequency bins are spaced fs/N Hz apart. To resolve two close frequencies, you need a long window (large N). This trades time resolution for frequency resolution — the uncertainty principle.
The DFT appears surprisingly often in modern ML:
• FNet (2021): Replaced attention in transformers with a 2D DFT. Achieved 92% of BERT's accuracy with 7× faster training. The DFT captures token mixing without learned parameters.
• Spectral normalization: Used in GAN training to constrain the spectral norm (largest singular value) of weight matrices. The FFT speeds up certain norm computations.
• Audio models: Whisper, WaveNet, and all speech models use STFT (Short-Time Fourier Transform) as the input representation. The model sees spectrograms, not raw waveforms.
• Efficient convolutions: For very long sequences (>4096 tokens), some architectures replace attention with FFT-based convolution (e.g., Hyena, S4). This gives O(N log N) scaling vs. O(N2) for attention.
• Positional encoding: The sinusoidal positional encoding in the original Transformer (Vaswani et al., 2017) uses the same complex exponential frequencies as the DFT basis. Each position gets a unique "frequency fingerprint."
The DFT of a full signal gives frequency content but loses ALL time information. Which frequencies were present when? The Short-Time Fourier Transform solves this by computing the DFT on short overlapping frames:
where w[n] is a window function (Hann, Hamming) centered at time t. The result is a 2D time-frequency representation called a spectrogram. This is the input to most modern audio ML models (Whisper, AudioLM, MusicGen).
Frame parameters:
• Frame length: 25 ms for speech, 23 ms for music (1024 samples at 44.1 kHz)
• Hop size: 10 ms (256 samples) — 75% overlap between consecutive frames
• FFT size: often zero-padded to 2048 or 4096 for finer frequency resolution
• Window: Hann is standard (good sidelobe suppression)
The spectrogram is the bridge between the DFT (this lecture) and spectral descriptors (next lecture). Each column of the spectrogram is one DFT frame; spectral descriptors summarize each column into a few numbers.
The DFT is one of many orthonormal transforms. How does it compare?
| Transform | Basis Functions | Best For | Used In |
|---|---|---|---|
| DFT | Complex exponentials | Stationary signals, convolution | Audio processing, communications |
| DCT | Cosines only (real-valued) | Natural images, speech | JPEG, MPEG, AAC |
| DWT | Scaled/shifted wavelets | Non-stationary signals, edges | JPEG2000, denoising |
| KLT | Data-dependent (eigenvectors) | Optimal for specific statistics | PCA, theoretical bound |
The DCT is preferred over DFT for compression because: (1) it's real-valued (no phase to store), (2) it handles boundary conditions better (implicit even symmetry avoids edge discontinuities), and (3) it achieves energy compaction close to the optimal KLT for most natural signals. The DFT remains preferred for filtering and convolution (via the convolution theorem).