From O(n²) polynomial multiplication to O(n log n) — the most important algorithm you've never implemented.
You have two polynomials. One represents the frequency response of a microphone. The other represents the frequency response of a room. To figure out what a listener actually hears, you need to multiply these polynomials together. The result is a new polynomial whose degree is the sum of the two input degrees.
Here is the trouble. If each polynomial has n terms, multiplying them the way you learned in algebra class — distributing every term against every other term — takes O(n²) operations. For a polynomial with a million terms (common in signal processing), that is a trillion multiplications. Your computer will take minutes. Hours if n grows further.
In 1965, James Cooley and John Tukey published a 5-page paper that changed the world. They showed how to multiply polynomials in O(n log n) time — an algorithm called the Fast Fourier Transform (FFT). For n = 1,000,000, that is 20 million operations instead of a trillion. A speedup of 50,000×. On a modern laptop that is the difference between a millisecond and several minutes.
The FFT does not just speed up polynomial multiplication. It transforms the problem into a domain where multiplication is trivial, then transforms back. This idea — change representation, do easy work, change back — is one of the most powerful strategies in all of algorithm design. It is the same principle behind taking logarithms to turn multiplication into addition, or using eigendecomposition to simplify matrix operations.
Let us see the problem concretely. Suppose A(x) = 3 + 2x + x² and B(x) = 1 + 4x. To multiply them:
Every term in A multiplied every term in B — that is 3 × 2 = 6 multiplications. For degree-n polynomials, the product has degree 2n, and computing each of the 2n+1 output coefficients requires summing up to n+1 products. Total work: O(n²).
To see why it is exactly O(n²), think about it as a grid. Write A's coefficients along one axis and B's along the other. Each cell in the grid is one multiplication. The product polynomial's k-th coefficient ck is the sum of all cells along the k-th anti-diagonal:
This sum is called a convolution — a word we will return to in Chapter 6. For now, the point is that computing all ck values requires visiting every cell in the grid: (n+1)×(m+1) cells for polynomials of degree n and m.
How bad is O(n²)? For a degree-100 polynomial (modest by real-world standards), that is 10,000 multiplications. For degree-10,000 (common in signal processing), 100 million. For degree-1,000,000 (not unusual in scientific computing), one trillion. Even at a billion operations per second, that is 1,000 seconds — over 16 minutes for a single polynomial multiplication.
The FFT reduces this to about 20 million operations for degree-1,000,000. At a billion operations per second, that is 0.02 seconds. From 16 minutes to 20 milliseconds. This is not a marginal improvement — it is the difference between "impossible in practice" and "instant."
The simulation below shows this process. Watch how the number of multiplications (the orange count) grows quadratically as you increase the polynomial degree.
Each cell in the grid represents one single-term multiplication. The total number of cells is (deg A + 1) × (deg B + 1). Watch it explode as degree grows.
The FFT's genius is a three-step strategy: (1) convert the polynomials to a different representation where multiplication is trivial, (2) multiply, (3) convert back. The conversion steps each take O(n log n), and the multiplication in the alternate representation takes O(n). Total: O(n log n).
That alternate representation is called the point-value form, and the conversion is the Discrete Fourier Transform. Over the next nine chapters, we will build the entire pipeline from scratch.
A polynomial of degree n is usually written as a list of coefficients. A(x) = a0 + a1x + a2x² + ... + anxn. We store it as a vector a = (a0, a1, ..., an). This is the coefficient representation. It has n+1 numbers and uniquely determines the polynomial.
For example, the polynomial 3 + 2x + x² is stored as [3, 2, 1]. The polynomial 5 - x³ is stored as [5, 0, 0, -1]. The index of each element is the power of x it multiplies. This is the representation you learned in algebra class, and it is by far the most common way to work with polynomials.
But there is another way. A polynomial of degree n is also uniquely determined by its values at n+1 distinct points. If you know that A(0) = 3, A(1) = 6, and A(2) = 11, that is enough to recover A(x) = 3 + 2x + x². Let us verify: A(0) = 3 + 0 + 0 = 3. A(1) = 3 + 2 + 1 = 6. A(2) = 3 + 4 + 4 = 11. Check.
This is the point-value representation: a set of pairs {(x0, y0), (x1, y1), ..., (xn, yn)} where yk = A(xk). Why does this work? Because n+1 distinct points uniquely determine a degree-n polynomial. This is the polynomial interpolation theorem: given any n+1 points with distinct x-values, there is exactly one polynomial of degree ≤ n that passes through all of them. The proof uses the Vandermonde matrix, which is nonsingular when all xk are distinct.
The recovery process — going from point-value form back to coefficient form — is called interpolation. Lagrange interpolation does this in O(n²) for arbitrary points. But for the special points we will use (roots of unity), the inverse FFT does it in O(n log n).
Let us be precise. If A is represented by {(x0, y0), ..., (xn, yn)} and B by {(x0, y'0), ..., (xn, y'n)} — evaluated at the same points x0, ..., xn — then C = A · B is simply {(x0, y0·y'0), ..., (xn, yn·y'n)}.
One subtlety: if A has degree na and B has degree nb, the product C has degree na + nb. So we need at least na + nb + 1 points to represent C. We evaluate both A and B at that many points before multiplying.
This feels almost too good to be true. Let us verify it. If A(x0) = 3 and B(x0) = 5, is it really true that (A · B)(x0) = 15? Yes! Polynomial multiplication produces a new polynomial C such that C(x) = A(x) · B(x) for all x. So evaluating C at x0 is literally A(x0) · B(x0). Point-value multiplication just uses this identity at each evaluation point.
The tradeoff is clear. In coefficient form: addition is O(n) (add corresponding coefficients), but multiplication is O(n²). In point-value form: multiplication is O(n) (multiply corresponding values), but converting between representations is the bottleneck.
| Operation | Coefficient Form | Point-Value Form |
|---|---|---|
| Addition | O(n) | O(n) |
| Multiplication | O(n²) | O(n) |
| Evaluation at one point | O(n) via Horner | O(n²) via interpolation |
The three-step strategy for fast polynomial multiplication is now clear:
The entire game is making steps 1 and 3 fast. Evaluating a polynomial at a single point takes O(n) by Horner's rule. Evaluating at 2n points naively takes O(n²) — no better than direct multiplication. The FFT evaluates at 2n cleverly chosen points in O(n log n).
What are those clever points? The complex roots of unity. That is the topic of the next chapter.
Before moving on, let us pin down exactly how evaluation works. Given A(x) = a0 + a1x + a2x² + a3x³, the naive approach computes each power of x separately — that is wasteful. Horner's rule rewrites the polynomial as:
Start from the innermost parenthesis and work outward: multiply by x, add the next coefficient, repeat. This uses exactly n multiplications and n additions — O(n) total. Horner's rule is optimal for single-point evaluation, but for multi-point evaluation (many different x values), we need a fundamentally different approach.
python def horner(coeffs, x): """Evaluate polynomial at x using Horner's method. O(n).""" result = 0 for a in reversed(coeffs): result = result * x + a return result # Example: A(x) = 3 + 2x + x² evaluated at x=5 print(horner([3, 2, 1], 5)) # 3 + 10 + 25 = 38
The same polynomial shown as coefficients (left) and as point-values (right). Drag the coefficient sliders to change the polynomial and watch the point-value pairs update instantly.
We need to evaluate polynomials at carefully chosen points so that the evaluation step can be done in O(n log n) instead of O(n²). The magic points are the complex roots of unity.
Why not just use simple points like 0, 1, 2, ..., n-1? Because those points have no special algebraic structure that we can exploit for speedup. The roots of unity have a very specific structure — they are symmetric, they recurse into smaller sets of roots, and they satisfy orthogonality conditions. All of this structure is what the FFT exploits.
If you have never worked with complex numbers, here is the one-minute version: a complex number z = a + bi has a real part a and an imaginary part b, where i = √(-1). You can plot z as a point (a, b) in the plane. The magnitude |z| = √(a²+b²) is the distance from the origin. Multiplication by i rotates a point 90° counterclockwise. Multiplication by eiθ rotates by angle θ.
The n-th roots of unity are the n complex numbers that satisfy the equation zn = 1. They are evenly spaced around the unit circle in the complex plane:
Here ωn = e2πi/n is the principal n-th root of unity — the first one counterclockwise from 1 on the unit circle. All other n-th roots are powers of ωn.
For example, the 4th roots of unity are ω40 = 1, ω41 = i, ω42 = -1, ω43 = -i. They sit at the four compass points of the unit circle: east, north, west, south. The 8th roots add four more points in between (northeast, northwest, southwest, southeast). The 16th roots add eight more, and so on — each doubling fills in the gaps.
Notice that 1 is always a root of unity (for any n), and so is -1 (when n is even). The roots always come in conjugate pairs: if a + bi is a root, so is a - bi. This is why real-valued inputs produce conjugate-symmetric DFT outputs.
Roots of unity have three properties that make the FFT possible. Without any one of them, the algorithm breaks.
The halving lemma is the engine of the FFT. When we split a polynomial into its even- and odd-indexed coefficients, the evaluation points square — and by the halving lemma, n evaluation points become n/2 evaluation points. That is the divide step that gives us O(n log n).
Let us compute the 8th roots concretely. ω8 = e2πi/8 = eπi/4 = cos(45°) + i·sin(45°) = (√2/2) + i(√2/2) ≈ 0.707 + 0.707i.
Now verify the halving lemma. Square each 8th root: (ω80)² = 1, (ω81)² = i, (ω82)² = -1, (ω83)² = -i. Those are exactly the 4th roots! And the pattern repeats: (ω84)² = 1, (ω85)² = i, etc. Eight points collapsed to four. Squaring those four gives two points (the 2nd roots: 1 and -1). One more squaring gives one point (the 1st root: 1). Three halvings from 8 to 1 — that is log2(8) = 3 recursion levels.
The n-th roots of unity are evenly spaced on the unit circle. Use the slider to change n. Notice how n=8 roots include all n=4 roots (halving lemma). Click a root to see its value.
Now we can define the evaluation step precisely. The Discrete Fourier Transform (DFT) of a coefficient vector a = (a0, a1, ..., an-1) is a new vector y = (y0, y1, ..., yn-1) where:
In words: yk is the polynomial A evaluated at ωnk. The DFT evaluates a polynomial at all n-th roots of unity simultaneously.
This is a matrix-vector product. We can write y = Fn · a where Fn is the DFT matrix:
For n = 4, with ω = ω4 = e2πi/4 = i, the DFT matrix has a beautiful structure. Every entry is a power of i, and since i4 = 1, there are only four possible values: 1, i, -1, -i. The matrix looks like:
The first row is all 1s (evaluating at ω0 = 1 just sums the coefficients). The first column is all 1s (every root raised to the 0th power is 1). Each subsequent row rotates faster around the unit circle.
Look at the structure more carefully. Row 0 has all entries equal to 1 — evaluating at x=1 just sums all coefficients. Row 1 cycles through all n-th roots once. Row 2 cycles through them twice (hitting every other root). Row k cycles through k times. The DFT matrix is a frequency analysis machine: each row "tests" a different frequency by correlating the input signal with a sinusoid of that frequency.
This is exactly what a tuning fork does. A tuning fork resonates (vibrates strongly) when it encounters its natural frequency in a sound wave, and stays silent otherwise. Each row of the DFT matrix is a mathematical tuning fork for a different frequency. The output value yk tells you how much frequency k is present in the input signal.
Computing y = Fn · a naively is an n × n matrix-vector multiply: O(n²). That is no faster than distributing terms. The FFT computes this same product in O(n log n) by exploiting the structure of Fn — specifically, the halving lemma.
Think of the DFT matrix as a change of basis. The coefficient vector a lives in the "time domain" — it describes the polynomial by its coefficients. The DFT output y lives in the "frequency domain" — it describes the same polynomial by its values at the roots of unity. The matrix Fn is the change-of-basis matrix between these two representations.
A key property: Fn is a unitary matrix (up to scaling). Its columns are orthogonal, and each column has the same norm. This means the DFT preserves energy: the sum of |ak|² equals (1/n) times the sum of |yk|². This is Parseval's theorem, and it is crucial in signal processing — the energy of a signal in the time domain equals its energy in the frequency domain.
Each cell shows ωnjk as a point on the unit circle. Row j, column k. The color encodes the angle: warm = positive real, teal = positive imaginary, purple = negative real, pink = negative imaginary. Toggle n to see how the matrix grows.
Let us compute the DFT of a = (1, 2, 3, 4) with n = 4. We need ω4 = e2πi/4 = i.
So DFT(1, 2, 3, 4) = (10, -2-2i, -2, -2+2i). Notice y3 is the complex conjugate of y1 — this always happens for real-valued inputs, and it means we only need to compute half the outputs in practice.
rfft) exploit this for 2× speedup and 2× memory savings.What do the DFT values mean? For the polynomial interpretation, yk = A(ωk) is simply the polynomial evaluated at the k-th root of unity. But for signal processing, the interpretation is richer:
| Element | Meaning |
|---|---|
| y0 | DC component — the sum (or average) of the signal. No oscillation. |
| y1 | Amplitude and phase of the fundamental frequency — one cycle per signal period. |
| yk | Amplitude and phase of the k-th harmonic — k cycles per signal period. |
| yn/2 | Nyquist frequency — the highest frequency representable at this sampling rate. |
The magnitude |yk| tells you how strong frequency k is. The angle arg(yk) tells you the phase shift of that frequency component. Together, they give the complete frequency decomposition of the signal.
This is the heart of the chapter. We want to compute the DFT in O(n log n) instead of O(n²). The trick is divide and conquer on the polynomial itself.
The key question: how do we split the polynomial? We cannot split by degree (first half / second half of coefficients) because that does not create two independent subproblems at the same evaluation points. Instead, we split by parity of index: even-indexed coefficients go to one subproblem, odd-indexed to the other.
Given A(x) = a0 + a1x + a2x² + a3x³ + ... + an-1xn-1, separate the even-indexed and odd-indexed coefficients:
Then A(x) = Aeven(x²) + x · Aodd(x²). This is an identity — verify it by substituting back. For example, with A(x) = 1 + 2x + 3x² + 4x³:
The key insight: both Aeven and Aodd are polynomials of degree n/2 - 1, and they are both evaluated at x². We have split one degree-(n-1) evaluation problem into two degree-(n/2-1) evaluation problems.
Now here is where the roots of unity work their magic. We want to evaluate A at ωn0, ωn1, ..., ωnn-1. When we plug these in to A(x) = Aeven(x²) + x · Aodd(x²), we need to evaluate Aeven and Aodd at the squares of the n-th roots. By the halving lemma, squaring the n-th roots gives the (n/2)-th roots, each appearing twice:
So evaluating Aeven(x²) and Aodd(x²) at the n-th roots means evaluating Aeven and Aodd at the (n/2)-th roots — a DFT of half the size! This is the divide step: one size-n DFT becomes two size-(n/2) DFTs, plus O(n) work to combine them.
Why does the combine step have that plus/minus form? Let us derive it. For k in [0, n/2):
For k + n/2 (the second half of the output):
The key step: ωnn/2 = eπi = -1. The point halfway around the unit circle is exactly -1. So the second half of the output just negates the twiddle factor. One multiplication, two outputs. This is the butterfly.
Let us trace FFT([1, 2, 3, 4]) step by step to see the machinery in action.
This matches our hand computation from Chapter 3. The recursive FFT computed the same answer with only 4 complex multiplications (at the combine steps) instead of the 16 required by the naive DFT matrix multiply.
At each level of recursion, we do O(n) work (the n/2 butterfly operations). The recursion halves the problem each time, so there are log2(n) levels. Total: T(n) = 2T(n/2) + O(n) = O(n log n).
More precisely, at each level we perform n/2 complex multiplications (the twiddle factors) and n complex additions/subtractions (the butterfly sums). Over log2(n) levels, that is (n/2)·log2(n) multiplications and n·log2(n) additions. Since a complex multiplication takes 4 real multiplications and 2 real additions, the total real operation count is about 5n·log2(n).
| n | Naive O(n²) | FFT O(n log n) | Speedup |
|---|---|---|---|
| 24 = 16 | 256 | 64 | 4× |
| 28 = 256 | 65,536 | 2,048 | 32× |
| 210 = 1,024 | 1,048,576 | 10,240 | 102× |
| 216 = 65,536 | 4.3 × 109 | 1,048,576 | 4,096× |
| 220 = 1M | 1012 | 20,971,520 | 50,000× |
At n = 1 million, the FFT is fifty thousand times faster. This is why the FFT matters: it makes previously intractable problems routine.
python import cmath def fft(a): """Recursive FFT. Input length must be a power of 2.""" n = len(a) if n == 1: return a # base case: DFT of single element is itself # principal n-th root of unity omega_n = cmath.exp(2j * cmath.pi / n) omega = 1 # split into even and odd a_even = a[0::2] # indices 0, 2, 4, ... a_odd = a[1::2] # indices 1, 3, 5, ... # recurse y_even = fft(a_even) y_odd = fft(a_odd) # combine (butterfly operations) y = [0] * n for k in range(n // 2): t = omega * y_odd[k] # twiddle factor y[k] = y_even[k] + t # top wing y[k + n//2] = y_even[k] - t # bottom wing omega *= omega_n return y
Each line maps directly to the math. The variable omega accumulates ωnk as k increments. The variable t is the twiddle factor — the term ωnk · oddk that gets added to and subtracted from the even result. One complex multiplication produces two outputs. This is what makes the FFT efficient: each butterfly does 1 multiply for 2 outputs.
The recursive FFT is elegant but has overhead from function calls and array slicing. In practice, most implementations use an iterative version. The idea: process the butterfly stages bottom-up, from smallest to largest. The input must first be placed in bit-reversed order — index k goes to position reverse(k), where reverse flips the binary digits of k.
Why bit-reversal? The recursive FFT splits even/odd repeatedly. After log n splits, element at index k ends up at position bitReverse(k). The iterative version does this permutation upfront, then runs the butterfly stages in order.
python import cmath def fft_iterative(a, invert=False): """Iterative FFT using bit-reversal permutation.""" n = len(a) a = a[:] # copy log_n = n.bit_length() - 1 # bit-reversal permutation for i in range(n): rev = int(bin(i)[2:].zfill(log_n)[::-1], 2) if i < rev: a[i], a[rev] = a[rev], a[i] # butterfly stages length = 2 while length <= n: angle = 2 * cmath.pi / length * (-1 if invert else 1) omega_n = cmath.exp(1j * angle) for start in range(0, n, length): omega = 1 for j in range(length // 2): u = a[start + j] t = omega * a[start + j + length // 2] a[start + j] = u + t a[start + j + length // 2] = u - t omega *= omega_n length *= 2 if invert: a = [x / n for x in a] return a
The iterative version runs in-place with O(1) extra memory (beyond the input array). This matters in practice: modern CPUs are cache-bound, and in-place operations are significantly faster than allocating new arrays at each recursion level.
| Aspect | Recursive FFT | Iterative FFT |
|---|---|---|
| Code clarity | Very clean, maps directly to math | More complex, bit-reversal upfront |
| Memory | O(n log n) stack + new arrays | O(1) beyond input (in-place) |
| Cache behavior | Poor (scattered memory access) | Good (sequential access per stage) |
| Practical speed | Slower due to allocation overhead | Faster — used by all production libraries |
| When to use | Teaching, prototyping, interviews | Production code, competitive programming |
For interviews, the recursive version is usually fine — it is cleaner and easier to get right under pressure. For production, always use a library (NumPy, FFTW, cuFFT) that uses the iterative approach with platform-specific optimizations.
The iterative FFT starts by rearranging the input in bit-reversed order. Watch how index k (binary) maps to position reverse(k). This ensures the butterfly stages process elements in the correct pairs.
Watch the FFT process an 8-element input. Each stage halves the subproblem. Lines show how values combine: solid = add, dashed = subtract (with twiddle factor). Click Step to advance one stage.
We can now evaluate a polynomial at all n-th roots of unity in O(n log n) time — converting from coefficient form to point-value form. And we know that in point-value form, multiplication is just O(n) pointwise multiplication. But after multiplying, we have a product polynomial in point-value form. To be useful, we need to convert it back to coefficient form. We need the inverse DFT.
The beautiful surprise: the inverse DFT is almost the same as the forward DFT. This means we can reuse 100% of our FFT code for both directions.
Recall the DFT is y = Fn · a. The inverse is a = Fn-1 · y. What is Fn-1?
Here is the beautiful result: the inverse DFT matrix is almost the same as the forward DFT matrix. Specifically:
The only differences from the forward DFT: (1) replace ωn with ωn-1 (the conjugate root — go clockwise instead of counterclockwise), and (2) divide by n at the end.
Why does this work? Let us prove it. We need to show that Fn · Fn-1 = I, where Fn-1[j,k] = (1/n)ωn-jk.
The (i, j)-th entry of Fn · Fn-1 is:
By the summation lemma, this sum equals n when i = j (each term is ω0 = 1, summed n times), and 0 when i ≠ j (the n-th roots of unity at power i-j cancel). So (1/n) · n = 1 on the diagonal and (1/n) · 0 = 0 elsewhere. That is the identity matrix. QED.
One practical concern: the FFT uses floating-point arithmetic, and complex exponentials are irrational numbers. After a forward FFT and inverse FFT roundtrip, the recovered coefficients are approximately equal to the original — not exact. For integer coefficients, the error is typically less than 10-10, so rounding to the nearest integer gives exact results. But for very large n (millions), errors can accumulate.
This is one motivation for the Number Theoretic Transform (NTT), which works over finite fields and gives exact results. We will discuss the NTT in Chapter 9.
We now have all three pieces for O(n log n) polynomial multiplication:
python import cmath def ifft(y): """Inverse FFT. Same as FFT but with conjugate roots, then divide by n.""" n = len(y) if n == 1: return y omega_n = cmath.exp(-2j * cmath.pi / n) # NEGATIVE exponent omega = 1 y_even = y[0::2] y_odd = y[1::2] a_even = ifft(y_even) a_odd = ifft(y_odd) a = [0] * n for k in range(n // 2): t = omega * a_odd[k] a[k] = a_even[k] + t a[k + n//2] = a_even[k] - t omega *= omega_n return a def inverse_fft(y): """Full inverse: run ifft then divide by n.""" n = len(y) a = ifft(y) return [x / n for x in a]
Enter coefficients on the left. Forward FFT produces frequency-domain values (middle). Inverse FFT recovers the original coefficients (right). Verify they match!
This chapter is the payoff. Everything we have built — polynomial representations, roots of unity, the DFT, the FFT, the inverse FFT — comes together in one of the most powerful ideas in applied mathematics: convolution.
Polynomial multiplication has a name in signal processing: convolution. When you multiply A(x) · B(x) = C(x), the coefficient ck of the product is:
This is the convolution sum, often written c = a * b (the asterisk means convolution, not multiplication). It appears everywhere in science and engineering: when you blur an image (convolve the image with a kernel), when you filter audio (convolve the signal with a filter), when you compute the probability distribution of the sum of two random variables, when you cross-correlate two time series, when you apply a moving average. In every case, the naive computation is O(n²) and the FFT makes it O(n log n).
The connection to polynomial multiplication is immediate: the k-th coefficient of C = A · B is the convolution of the coefficient vectors of A and B. So every technique for fast polynomial multiplication automatically gives fast convolution.
Think of it this way. You have a signal (a sequence of numbers representing audio samples). You have a filter (another sequence — maybe a low-pass filter that removes high-frequency noise). To apply the filter, you convolve the signal with the filter. The naive approach slides the filter across the signal, computing a dot product at each position — O(n²). The FFT approach: transform both to frequency domain, multiply pointwise, transform back — O(n log n).
This is not magic — it follows directly from what we already know. The coefficient vector of the product polynomial A · B is the convolution of the coefficient vectors of A and B. And we showed that polynomial multiplication can be done by: FFT, pointwise multiply, IFFT. Combining these two facts:
That is the convolution theorem: convolution in the time domain equals pointwise multiplication in the frequency domain. The FFT is simply the efficient way to switch between domains.
python import numpy as np def convolve_fft(signal, kernel): """Convolve signal with kernel using FFT.""" n = len(signal) + len(kernel) - 1 N = 1 while N < n: N <<= 1 # FFT both, pointwise multiply, IFFT S = np.fft.fft(signal, N) K = np.fft.fft(kernel, N) return np.fft.ifft(S * K).real[:n] # Verify against numpy's direct convolution s = [1, 2, 3, 4, 5] k = [0.2, 0.6, 0.2] # smoothing kernel print(np.allclose(convolve_fft(s, k), np.convolve(s, k))) # True
Let us multiply A(x) = 1 + 2x + 3x² and B(x) = 2 + 1x using the full FFT pipeline.
Verify: (1 + 2x + 3x²)(2 + x) = 2 + x + 4x + 2x² + 6x² + 3x³ = 2 + 5x + 8x² + 3x³. Correct.
Here is a beautiful application that makes convolution tangible. Roll two fair dice. What is the probability of each possible sum (2 through 12)?
Each die has probabilities [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] for outcomes [1, 2, 3, 4, 5, 6]. The distribution of the sum of two independent random variables is the convolution of their individual distributions.
The convolution of two uniform distributions gives a triangular distribution. If you had 100 dice (combinatorics exam question!), the naive approach requires summing over all combinations — but the FFT computes the convolution of 100 distributions in O(n · 100 · log(n·100)) time by convolving them pairwise.
Watch the three-step process: forward FFT on both polynomials, pointwise multiply, inverse FFT. Adjust coefficients of A and B. The result C = A · B is shown below.
Now the real payoff. Instead of abstract polynomials, think of the coefficients as signal samples. A signal a = [a0, a1, ..., an-1] represents sound pressure at each time step. A kernel b = [b0, b1, ..., bm-1] represents a filter (like an averaging window for smoothing, or a difference operator for edge detection).
Convolution slides the kernel across the signal, computing a weighted sum at each position. The output ck = ∑ aj · bk-j is the filtered signal.
A smoothing kernel (box filter) is just [1/w, 1/w, ..., 1/w] — it averages w adjacent samples. This removes high-frequency noise but blurs sharp edges. A Gaussian kernel weights the center more heavily than the edges, giving a smoother blur.
An edge detection kernel like [-1, 0, 1] computes the discrete derivative — the difference between adjacent samples. Wherever the signal changes sharply, the output spikes. This is how Photoshop's "find edges" filter works.
A sharpening kernel emphasizes differences: it subtracts a blurred version of the signal from the original, amplifying edges. The kernel [0, -1, 5, -1, 0] (center-heavy with negative sides) achieves this.
An echo kernel like [0.7, 0, 0, ..., 0, 0.3] adds a delayed copy of the signal to itself. In audio, this literally creates an echo — you hear the original sound followed by a quieter repetition. Professional reverb effects use kernels with thousands of samples (the impulse response of a physical space, recorded by playing a loud click in a concert hall and recording the result). Convolving dry audio with a hall impulse response gives you audio that sounds like it was recorded in that hall. The FFT makes this real-time.
Top: input signal (orange). Middle: convolution kernel (teal). Bottom: convolved output (purple). Choose different kernel types and watch how the signal changes. All computed via FFT!
The FFT is not merely a theoretical curiosity — it is one of the most practically consequential algorithms in computing. Every application below would be intractable without it.
A microphone records sound as a sequence of pressure values over time — the time domain representation. But sound is really a sum of sinusoidal waves at different frequencies. The FFT converts from time domain to frequency domain, revealing which frequencies are present and how strong each one is. This is called spectral analysis.
Your phone does this in real time when it identifies a song (Shazam), processes your voice (speech recognition), or noise-cancels your AirPods. The FFT runs millions of times per second in these applications.
A concrete example: CD-quality audio is sampled at 44,100 Hz (44,100 samples per second). To analyze the frequency content of a 1-second clip, you compute the FFT of 44,100 samples. The output gives you 22,050 frequency bins (by conjugate symmetry), each representing a frequency from 0 Hz to 22,050 Hz. The magnitude of each bin tells you how loud that frequency is. A piano note at middle C (262 Hz) will show a peak at bin 262 and its harmonics (524 Hz, 786 Hz, etc.).
For real-time analysis, you use the Short-Time Fourier Transform (STFT): take overlapping windows of ~1,024 samples, FFT each window, and stack the results. The output is a spectrogram — a 2D plot with time on one axis, frequency on the other, and color representing magnitude. Spectrograms are the standard visualization for audio in music production, speech analysis, and machine learning.
A 2D image is a grid of pixel values. Blurring, sharpening, and edge detection are all 2D convolutions — sliding a small kernel (like a 3×3 Gaussian) across the image. For a 4K image (8 million pixels) and a large kernel, naive convolution is slow. The 2D FFT (applying the 1D FFT to rows, then columns) makes it O(N log N) where N is the total number of pixels.
Every time you apply a filter in Photoshop or Instagram, the 2D FFT is running under the hood. MRI scanners also use the 2D (and 3D) FFT: the scanner collects data in "k-space" (the frequency domain), and the inverse FFT converts it to the spatial image your doctor sees. A single MRI scan requires millions of FFT operations.
Multiplying two n-digit integers is exactly polynomial multiplication — treat each digit as a coefficient. The number 1234 = 4 + 3·10 + 2·100 + 1·1000 = polynomial A(x) = 4 + 3x + 2x² + x³ evaluated at x = 10.
So to multiply two n-digit numbers: treat them as polynomials, multiply using FFT, then propagate carries (because digit values can exceed 9 after multiplication). Grade-school multiplication is O(n²). The FFT-based Schönhage-Strassen algorithm (1971) multiplies n-digit integers in O(n log n log log n). This is what Python uses internally for very large integers (thousands of digits). In 2019, Harvey and van der Hoeven achieved the theoretically optimal O(n log n), matching the FFT bound exactly.
Surprise: you can find all occurrences of a pattern in a text using convolution. Encode the text and pattern as polynomials (character values as coefficients), convolve, and read off matches from the output. The trick: reverse the pattern before convolving, so that the convolution computes the dot product of the pattern with each window of the text.
This is especially useful for wildcard matching where the pattern contains "don't care" characters — a case where KMP and Boyer-Moore struggle. With wildcards, define a match score polynomial: for each text position, the convolution output gives the number of matching characters. Positions where the score equals the pattern length (minus wildcards) are matches. The FFT computes all scores simultaneously in O(n log n).
MP3 compression works by transforming audio to the frequency domain using a cousin of the FFT called the Modified Discrete Cosine Transform (MDCT). In the frequency domain, the encoder identifies which frequencies are inaudible to human ears (psychoacoustic model) and removes them. The result: 10× compression with minimal perceived quality loss.
Many differential equations (heat equation, wave equation, Schrödinger equation) become simple algebraic equations in the frequency domain. Transform, solve, transform back. This is called the spectral method and it powers weather prediction models, fluid dynamics simulations, and quantum chemistry calculations.
In 2021, Google's FNet paper showed that replacing the attention mechanism in Transformers with a simple FFT achieves 92% of BERT's accuracy while being 7× faster on GPUs. The intuition: attention mixes information across token positions, and the FFT does the same thing (it mixes all input positions via the butterfly network). The FFT's fixed mixing pattern is less flexible than learned attention weights, but its O(n log n) complexity beats attention's O(n²).
The FFT also appears in spectral graph neural networks, where convolution on graphs is defined via the graph Fourier transform (eigendecomposition of the graph Laplacian).
Post-quantum cryptography relies heavily on polynomial arithmetic over finite fields. The NIST-standardized algorithms CRYSTALS-Kyber (key encapsulation) and CRYSTALS-Dilithium (digital signatures) both use the Number Theoretic Transform (NTT) — the FFT over Z/qZ — as their core computational primitive. Every time you send an encrypted message using post-quantum crypto, the NTT is running.
The security of these systems relies on the hardness of the Ring Learning With Errors (RLWE) problem, which involves polynomial arithmetic in Zq[x]/(xn+1). Adding and multiplying polynomials in this ring requires modular reduction, and the NTT makes the multiplication step efficient. Without the FFT/NTT, post-quantum cryptography would be too slow for practical use.
| Application | What's Convolved | Naive Cost | FFT Cost |
|---|---|---|---|
| Audio filtering | Signal × filter impulse response | O(n²) | O(n log n) |
| Image blur | Image × Gaussian kernel | O(N·k²) | O(N log N) |
| Big integer × | Digit vectors | O(n²) | O(n log n) |
| Polynomial × | Coefficient vectors | O(n²) | O(n log n) |
| String matching | Text × reversed pattern | O(nm) | O(n log n) |
| MRI reconstruction | k-space data → image | O(N²) | O(N log N) |
Compose a signal from up to 4 sine waves of different frequencies. The FFT decomposes it back into its constituent frequencies. Adjust frequency and amplitude sliders to see the spectrum change.
A few things that textbooks often omit but matter in real code:
The FFT has overhead: you need to (1) zero-pad both arrays, (2) compute two forward FFTs, (3) compute pointwise products, and (4) compute one inverse FFT. For small inputs, this overhead is larger than the naive O(nk) cost.
The crossover point depends on the implementation and hardware, but as a rule of thumb: if the kernel has fewer than about 64 elements, direct convolution is faster. If the kernel has more than about 64 elements, FFT convolution wins. For very large kernels (thousands of elements), the FFT is dramatically faster.
In machine learning, most convolution layers use small kernels (3×3 or 5×5), so they use direct convolution (or the im2col + GEMM approach). But in audio processing, reverb kernels can have 100,000+ samples, and the FFT is essential.
Libraries like SciPy automatically choose between direct and FFT convolution based on input size — the function scipy.signal.fftconvolve always uses FFT, while scipy.signal.convolve picks the faster method.
| Concept | Key Fact |
|---|---|
| DFT | Evaluate polynomial at n-th roots of unity: yk = ∑ ajωjk |
| FFT | Compute DFT in O(n log n) via divide-and-conquer on even/odd coefficients |
| Inverse FFT | Same algorithm with ω-1 instead of ω, divide by n |
| Butterfly | yk = ek + ωkok, yk+n/2 = ek - ωkok |
| Convolution theorem | Time-domain convolution = frequency-domain pointwise multiply |
| Poly multiplication | FFT both → pointwise multiply → IFFT. O(n log n) total |
| Input requirement | Length must be power of 2 (pad with zeros) |
| Roots of unity | ωnk = e2πik/n. Halving: (ωnk)² = ωn/2k |
python import cmath def fft(a, invert=False): """Unified FFT/IFFT. Set invert=True for inverse.""" n = len(a) if n == 1: return a[:] # direction: +1 for forward, -1 for inverse angle = 2 * cmath.pi / n * (-1 if invert else 1) omega_n = cmath.exp(1j * angle) omega = 1 a_even = fft(a[0::2], invert) a_odd = fft(a[1::2], invert) result = [0] * n for k in range(n // 2): t = omega * a_odd[k] result[k] = a_even[k] + t result[k + n//2] = a_even[k] - t omega *= omega_n if invert: result = [x / 2 for x in result] # divide by 2 at each level = /n total return result def poly_multiply(a, b): """Multiply polynomials a and b using FFT. Returns coefficient list.""" # find result size and pad to power of 2 result_len = len(a) + len(b) - 1 n = 1 while n < result_len: n <<= 1 # pad with zeros a = a + [0] * (n - len(a)) b = b + [0] * (n - len(b)) # forward FFT fa = fft(a) fb = fft(b) # pointwise multiply fc = [fa[i] * fb[i] for i in range(n)] # inverse FFT c = fft(fc, invert=True) # round to integers if coefficients were integers return [round(x.real) for x in c[:result_len]] # Test: (1 + 2x + 3x²) * (2 + x) = 2 + 5x + 8x² + 3x³ print(poly_multiply([1, 2, 3], [2, 1])) # [2, 5, 8, 3]
The FFT rarely appears by name in coding interviews. Instead, the interviewer asks a problem that is secretly convolution:
| Problem Pattern | Why It's Convolution |
|---|---|
| "Multiply two big numbers represented as arrays" | Digits are polynomial coefficients. Multiply polynomials, carry. |
| "Count pairs with difference k" or "histogram of sums" | Frequency count is a polynomial. Sum of two dice = convolution of two distributions. |
| "Find pattern with wildcards in text" | Encode as polynomials. Convolution gives match scores at each position. |
| "Multiply two polynomials mod x^n" | Direct FFT application. Common in competitive programming. |
| "Apply a 1D filter to a signal" | Filter = convolution kernel. FFT for large kernels. |
Treat digits as polynomial coefficients. Multiply using FFT. Then propagate carries.
python def big_multiply(num1_str, num2_str): """Multiply two large integers represented as strings.""" a = [int(d) for d in reversed(num1_str)] # least significant first b = [int(d) for d in reversed(num2_str)] # polynomial multiplication via FFT c = poly_multiply(a, b) # propagate carries carry = 0 result = [] for val in c: total = val + carry result.append(total % 10) carry = total // 10 while carry: result.append(carry % 10) carry //= 10 return ''.join(str(d) for d in reversed(result)).lstrip('0') or '0' # Test: 123 * 456 = 56088 print(big_multiply("123", "456")) # "56088"
In practice, you will use a library. NumPy's FFT is highly optimized (FFTPACK under the hood).
python import numpy as np def poly_multiply_numpy(a, b): """Polynomial multiplication using NumPy's FFT.""" result_len = len(a) + len(b) - 1 n = 1 while n < result_len: n <<= 1 fa = np.fft.fft(a, n) fb = np.fft.fft(b, n) fc = fa * fb c = np.fft.ifft(fc).real return np.round(c[:result_len]).astype(int) # 5 lines. That is it. The FFT does the heavy lifting. print(poly_multiply_numpy([1,2,3], [2,1])) # [2, 5, 8, 3]
| Mistake | Why It Fails | Fix |
|---|---|---|
| Forgetting to pad to power of 2 | FFT requires n = 2k | Pad with zeros: while n < len: n *= 2 |
| Not padding enough for convolution | Get circular convolution artifacts | Pad to len(a) + len(b) - 1, rounded up |
| Using forward FFT for inverse | Wrong roots, no 1/n scaling | Negate exponent, divide by n |
| Off-by-one in output length | Product of deg-n and deg-m has deg n+m | Result has n+m+1 coefficients |
| Not rounding after IFFT | Floating-point noise gives 2.999999 | Round to int if coefficients are integers |
As n grows, the FFT's advantage becomes staggering. Drag the slider to see the operation count for naive vs FFT polynomial multiplication.
For FFT-related interview questions, make sure you can address these five aspects:
| Dimension | What to Demonstrate |
|---|---|
| CONCEPT | Explain the three-step strategy (evaluate, multiply, interpolate). Why roots of unity? Why O(n log n)? |
| DESIGN | When to use FFT vs naive convolution (crossover around n ≈ 64). Zero-padding for linear convolution. Power-of-2 sizing. |
| CODE | Implement FFT from scratch (recursive or iterative). Use library FFT for polynomial multiply. NTT for modular arithmetic. |
| DEBUG | Common bugs: circular vs linear convolution, off-by-one in output length, forgetting to divide by n in IFFT. |
| FRONTIER | NTT for post-quantum crypto. FNet (FFT replacing attention). Sparse FFT for near-sparse signals. Sub-linear FFT. |
The FFT connects to an extraordinary range of topics in computer science and mathematics. Here is how it fits into the CLRS universe and beyond.
| Chapter | Connection |
|---|---|
| Ch 4: Divide & Conquer | The FFT is a D&C algorithm: T(n) = 2T(n/2) + O(n) = O(n log n). The master theorem (Case 2) gives the complexity directly. |
| Ch 31: Number Theory | The FFT works over any ring with principal roots of unity — including Z/pZ for prime p. This gives the Number Theoretic Transform (NTT), which avoids floating-point errors entirely. |
| Ch 32: String Matching | Wildcard pattern matching via convolution: encode pattern and text as polynomials, convolve, and read off match positions from the output. |
| Topic | How FFT Appears |
|---|---|
| Signal Processing | The foundation of DSP. STFT, spectrogram, filter design, noise cancellation. |
| Machine Learning | Efficient attention mechanisms (FNet replaces attention with FFT). Spectral graph convolutions. |
| Cryptography | NTT for polynomial operations in lattice-based post-quantum crypto (CRYSTALS-Kyber, Dilithium). |
| Competitive Programming | Any problem reducible to polynomial multiplication or convolution. ~15% of advanced CP problems use FFT. |
The FFT uses complex roots of unity, which require floating-point arithmetic. For exact computation (big integers, competitive programming), the NTT works in modular arithmetic. Choose a prime p such that p - 1 is divisible by a large power of 2 (popular choice: p = 998244353, where p - 1 = 223 · 119). The primitive root modulo p plays the role of ω. The algorithm is identical — just replace complex exponentiation with modular exponentiation.
Why does this work? Because a finite field Z/pZ (integers mod prime p) also has roots of unity. If g is a primitive root mod p (meaning g generates all nonzero elements), then ω = g(p-1)/n is a principal n-th root of unity mod p. It satisfies all three properties: cancellation, halving, and summation — all in modular arithmetic.
python # Number Theoretic Transform (NTT) # Exact polynomial multiplication over integers mod p MOD = 998244353 # NTT-friendly prime: 2^23 * 119 + 1 PRIM_ROOT = 3 # primitive root of MOD def power_mod(base, exp, mod): """Fast modular exponentiation.""" result = 1 base %= mod while exp > 0: if exp & 1: result = result * base % mod exp >>= 1 base = base * base % mod return result def ntt(a, invert=False): """In-place NTT. Works exactly like iterative FFT but mod p.""" n = len(a) j = 0 for i in range(1, n): bit = n >> 1 while j & bit: j ^= bit; bit >>= 1 j ^= bit if i < j: a[i], a[j] = a[j], a[i] length = 2 while length <= n: w = power_mod(PRIM_ROOT, (MOD - 1) // length, MOD) if invert: w = power_mod(w, MOD - 2, MOD) # modular inverse for start in range(0, n, length): wn = 1 for k in range(length // 2): u = a[start + k] v = a[start + k + length // 2] * wn % MOD a[start + k] = (u + v) % MOD a[start + k + length // 2] = (u - v) % MOD wn = wn * w % MOD length <<= 1 if invert: inv_n = power_mod(n, MOD - 2, MOD) for i in range(n): a[i] = a[i] * inv_n % MOD def poly_multiply_ntt(a, b): """Exact polynomial multiplication using NTT.""" result_len = len(a) + len(b) - 1 n = 1 while n < result_len: n <<= 1 fa = a + [0] * (n - len(a)) fb = b + [0] * (n - len(b)) ntt(fa); ntt(fb) fc = [fa[i] * fb[i] % MOD for i in range(n)] ntt(fc, invert=True) return fc[:result_len] # Test: (1 + 2x + 3x²)(2 + x) = 2 + 5x + 8x² + 3x³ print(poly_multiply_ntt([1,2,3], [2,1])) # [2, 5, 8, 3]
This NTT implementation is the standard template for competitive programming. It handles polynomials with coefficients up to ~109 and lengths up to 223 ≈ 8 million. The mod 998244353 is chosen specifically because its totient has a large power-of-2 factor, enabling the full radix-2 FFT decomposition.
| Situation | Use | Why |
|---|---|---|
| Coefficients are real/complex floats | Complex FFT | Natural domain, simpler code |
| Exact integer results needed | NTT or complex FFT + rounding | Complex FFT works for small ints (<1015) |
| Result needed mod a prime | NTT | Exact, no floating-point |
| Competitive programming | NTT (mod 998244353) | Exact, fast, standard template |
| Signal/image processing | Complex FFT | Data is naturally floating-point |
| Post-quantum cryptography | NTT | Security requires exact modular arithmetic |
| Variant | What Changes | When to Use |
|---|---|---|
| Cooley-Tukey FFT | Standard radix-2 D&C (this lesson) | General purpose, n = power of 2 |
| Mixed-Radix FFT | Splits into radix-3, radix-5, etc. | n not a power of 2 (e.g., n = 12 = 4×3) |
| Bluestein's FFT | Reduces arbitrary-n DFT to convolution | n is prime (no small factors) |
| NTT | Roots of unity in Z/pZ, not C | Exact arithmetic, competitive programming, crypto |
| 2D FFT | Apply 1D FFT to rows, then columns | Image processing, 2D convolution |
| STFT | FFT on overlapping windows | Time-frequency analysis (spectrograms) |
| DCT | Real-valued cousin, cosine basis | JPEG, MP3, video compression |
The Cooley-Tukey paper (1965) is often cited as the discovery of the FFT. But Gauss described essentially the same algorithm in 1805 — 160 years earlier — while computing the orbits of asteroids. He never published it as a standalone result. The lesson: important algorithms are often discovered, forgotten, and rediscovered.
The impact of Cooley-Tukey was immediate and enormous. Before 1965, computing a DFT of size 1024 required ~1 million operations. After, it required ~10,000. Entire fields — digital signal processing, computational physics, medical imaging — became practical overnight. The algorithm has been called "the most important numerical algorithm of our lifetime" by Gilbert Strang and "one of the great computational tools of our century" by the American Mathematical Society.
Let us step back and admire the full construction. We started with a simple problem — multiply two polynomials — and arrived at one of the deepest algorithms in computer science.
Every step builds on the previous one. Remove any link in the chain and the algorithm does not work. This is what makes the FFT one of the most elegant constructions in all of algorithm design.
The FFT is not just an algorithm — it is a way of thinking. The idea that changing representation can make hard problems easy, that structure in evaluation points enables divide and conquer, that the same algorithm works forward and backward — these ideas appear again and again across mathematics, physics, and computer science. Mastering the FFT gives you a template for recognizing and exploiting structure in computation.
The full polynomial multiplication pipeline visualized. Input coefficients → FFT → pointwise multiply → IFFT → output coefficients. Each box shows the data at that stage.