Polynomials & the FFT — CLRS Chapter 30

Chapter 0: The Problem

You have two polynomials. One represents the frequency response of a microphone. The other represents the frequency response of a room. To figure out what a listener actually hears, you need to multiply these polynomials together. The result is a new polynomial whose degree is the sum of the two input degrees.

Here is the trouble. If each polynomial has n terms, multiplying them the way you learned in algebra class — distributing every term against every other term — takes O(n²) operations. For a polynomial with a million terms (common in signal processing), that is a trillion multiplications. Your computer will take minutes. Hours if n grows further.

In 1965, James Cooley and John Tukey published a 5-page paper that changed the world. They showed how to multiply polynomials in O(n log n) time — an algorithm called the Fast Fourier Transform (FFT). For n = 1,000,000, that is 20 million operations instead of a trillion. A speedup of 50,000×. On a modern laptop that is the difference between a millisecond and several minutes.

The FFT does not just speed up polynomial multiplication. It transforms the problem into a domain where multiplication is trivial, then transforms back. This idea — change representation, do easy work, change back — is one of the most powerful strategies in all of algorithm design. It is the same principle behind taking logarithms to turn multiplication into addition, or using eigendecomposition to simplify matrix operations.

Why this matters. The FFT is routinely listed among the ten most important algorithms of the 20th century. It powers cell phone signals (OFDM), image compression (JPEG), audio compression (MP3), medical imaging (MRI reconstruction), weather prediction, financial modeling, and every time you use Shazam to identify a song. Without the FFT, modern digital civilization would not exist.

Let us see the problem concretely. Suppose A(x) = 3 + 2x + x² and B(x) = 1 + 4x. To multiply them:

A(x) · B(x) = (3 + 2x + x²)(1 + 4x)
  = 3·1 + 3·4x + 2x·1 + 2x·4x + x²·1 + x²·4x
  = 3 + 12x + 2x + 8x² + x² + 4x³
  = 3 + 14x + 9x² + 4x³

Every term in A multiplied every term in B — that is 3 × 2 = 6 multiplications. For degree-n polynomials, the product has degree 2n, and computing each of the 2n+1 output coefficients requires summing up to n+1 products. Total work: O(n²).

To see why it is exactly O(n²), think about it as a grid. Write A's coefficients along one axis and B's along the other. Each cell in the grid is one multiplication. The product polynomial's k-th coefficient c_k is the sum of all cells along the k-th anti-diagonal:

c_k = ∑_j=0^k a_j · b_k-j

This sum is called a convolution — a word we will return to in Chapter 6. For now, the point is that computing all c_k values requires visiting every cell in the grid: (n+1)×(m+1) cells for polynomials of degree n and m.

How bad is O(n²)? For a degree-100 polynomial (modest by real-world standards), that is 10,000 multiplications. For degree-10,000 (common in signal processing), 100 million. For degree-1,000,000 (not unusual in scientific computing), one trillion. Even at a billion operations per second, that is 1,000 seconds — over 16 minutes for a single polynomial multiplication.

The FFT reduces this to about 20 million operations for degree-1,000,000. At a billion operations per second, that is 0.02 seconds. From 16 minutes to 20 milliseconds. This is not a marginal improvement — it is the difference between "impossible in practice" and "instant."

The simulation below shows this process. Watch how the number of multiplications (the orange count) grows quadratically as you increase the polynomial degree.

Naive Polynomial Multiplication: O(n²)

Each cell in the grid represents one single-term multiplication. The total number of cells is (deg A + 1) × (deg B + 1). Watch it explode as degree grows.

Degree n 4

The FFT's genius is a three-step strategy: (1) convert the polynomials to a different representation where multiplication is trivial, (2) multiply, (3) convert back. The conversion steps each take O(n log n), and the multiplication in the alternate representation takes O(n). Total: O(n log n).

That alternate representation is called the point-value form, and the conversion is the Discrete Fourier Transform. Over the next nine chapters, we will build the entire pipeline from scratch.

What you will learn. By the end of this lesson you will be able to: (1) convert between coefficient and point-value representations, (2) explain why complex roots of unity enable O(n log n) evaluation, (3) trace through the FFT butterfly diagram by hand, (4) implement FFT-based polynomial multiplication in Python, and (5) recognize convolution problems in interviews and apply the FFT to solve them.

Concept check: Multiplying two degree-n polynomials the naive way (distributing every term) requires how many single-term multiplications?

O(n) — you just multiply corresponding coefficients O(n log n) — divide and conquer O(n²) — every term in A times every term in B O(n³) — it is like matrix multiplication

Chapter 1: Polynomial Representations

A polynomial of degree n is usually written as a list of coefficients. A(x) = a₀ + a₁x + a₂x² + ... + a_nxⁿ. We store it as a vector a = (a₀, a₁, ..., a_n). This is the coefficient representation. It has n+1 numbers and uniquely determines the polynomial.

For example, the polynomial 3 + 2x + x² is stored as [3, 2, 1]. The polynomial 5 - x³ is stored as [5, 0, 0, -1]. The index of each element is the power of x it multiplies. This is the representation you learned in algebra class, and it is by far the most common way to work with polynomials.

But there is another way. A polynomial of degree n is also uniquely determined by its values at n+1 distinct points. If you know that A(0) = 3, A(1) = 6, and A(2) = 11, that is enough to recover A(x) = 3 + 2x + x². Let us verify: A(0) = 3 + 0 + 0 = 3. A(1) = 3 + 2 + 1 = 6. A(2) = 3 + 4 + 4 = 11. Check.

This is the point-value representation: a set of pairs {(x₀, y₀), (x₁, y₁), ..., (x_n, y_n)} where y_k = A(x_k). Why does this work? Because n+1 distinct points uniquely determine a degree-n polynomial. This is the polynomial interpolation theorem: given any n+1 points with distinct x-values, there is exactly one polynomial of degree ≤ n that passes through all of them. The proof uses the Vandermonde matrix, which is nonsingular when all x_k are distinct.

The recovery process — going from point-value form back to coefficient form — is called interpolation. Lagrange interpolation does this in O(n²) for arbitrary points. But for the special points we will use (roots of unity), the inverse FFT does it in O(n log n).

Why two representations? Because multiplication behaves differently in each one. In coefficient form, multiplying two polynomials requires O(n²) work (convolving the coefficient vectors). In point-value form, multiplication is pointwise — just multiply the y-values at each point. That is O(n) work. The whole FFT strategy exploits this asymmetry.

Let us be precise. If A is represented by {(x₀, y₀), ..., (x_n, y_n)} and B by {(x₀, y'₀), ..., (x_n, y'_n)} — evaluated at the same points x₀, ..., x_n — then C = A · B is simply {(x₀, y₀·y'₀), ..., (x_n, y_n·y'_n)}.

One subtlety: if A has degree n_a and B has degree n_b, the product C has degree n_a + n_b. So we need at least n_a + n_b + 1 points to represent C. We evaluate both A and B at that many points before multiplying.

Why Does Point-Value Multiplication Work?

This feels almost too good to be true. Let us verify it. If A(x₀) = 3 and B(x₀) = 5, is it really true that (A · B)(x₀) = 15? Yes! Polynomial multiplication produces a new polynomial C such that C(x) = A(x) · B(x) for all x. So evaluating C at x₀ is literally A(x₀) · B(x₀). Point-value multiplication just uses this identity at each evaluation point.

The tradeoff is clear. In coefficient form: addition is O(n) (add corresponding coefficients), but multiplication is O(n²). In point-value form: multiplication is O(n) (multiply corresponding values), but converting between representations is the bottleneck.

Operation	Coefficient Form	Point-Value Form
Addition	O(n)	O(n)
Multiplication	O(n²)	O(n)
Evaluation at one point	O(n) via Horner	O(n²) via interpolation

The three-step strategy for fast polynomial multiplication is now clear:

1. Evaluate

Convert A and B from coefficient form to point-value form by evaluating at 2n points. This is called the DFT (Discrete Fourier Transform).

↓

2. Pointwise Multiply

Multiply the y-values pairwise: c_k = a_k · b_k. This takes O(n) time.

↓

3. Interpolate

Convert the product back from point-value form to coefficient form. This is the inverse DFT.

The entire game is making steps 1 and 3 fast. Evaluating a polynomial at a single point takes O(n) by Horner's rule. Evaluating at 2n points naively takes O(n²) — no better than direct multiplication. The FFT evaluates at 2n cleverly chosen points in O(n log n).

What are those clever points? The complex roots of unity. That is the topic of the next chapter.

Horner's Rule: Evaluating at a Single Point

Before moving on, let us pin down exactly how evaluation works. Given A(x) = a₀ + a₁x + a₂x² + a₃x³, the naive approach computes each power of x separately — that is wasteful. Horner's rule rewrites the polynomial as:

A(x) = a₀ + x(a₁ + x(a₂ + x · a₃))

Start from the innermost parenthesis and work outward: multiply by x, add the next coefficient, repeat. This uses exactly n multiplications and n additions — O(n) total. Horner's rule is optimal for single-point evaluation, but for multi-point evaluation (many different x values), we need a fundamentally different approach.

python
def horner(coeffs, x):
    """Evaluate polynomial at x using Horner's method. O(n)."""
    result = 0
    for a in reversed(coeffs):
        result = result * x + a
    return result

# Example: A(x) = 3 + 2x + x² evaluated at x=5
print(horner([3, 2, 1], 5))  # 3 + 10 + 25 = 38

Two Representations of the Same Polynomial

The same polynomial shown as coefficients (left) and as point-values (right). Drag the coefficient sliders to change the polynomial and watch the point-value pairs update instantly.

a₀ 1.0

a₁ 2.0

a₂ -0.5

Concept check: In point-value form, multiplying two polynomials (evaluated at the same set of points) takes how much time?

O(n) — just multiply corresponding y-values O(n log n) — need to sort the points first O(n²) — same as coefficient form O(1) — it is a single operation

Chapter 2: Roots of Unity

We need to evaluate polynomials at carefully chosen points so that the evaluation step can be done in O(n log n) instead of O(n²). The magic points are the complex roots of unity.

Why not just use simple points like 0, 1, 2, ..., n-1? Because those points have no special algebraic structure that we can exploit for speedup. The roots of unity have a very specific structure — they are symmetric, they recurse into smaller sets of roots, and they satisfy orthogonality conditions. All of this structure is what the FFT exploits.

If you have never worked with complex numbers, here is the one-minute version: a complex number z = a + bi has a real part a and an imaginary part b, where i = √(-1). You can plot z as a point (a, b) in the plane. The magnitude |z| = √(a²+b²) is the distance from the origin. Multiplication by i rotates a point 90° counterclockwise. Multiplication by e^iθ rotates by angle θ.

The n-th roots of unity are the n complex numbers that satisfy the equation zⁿ = 1. They are evenly spaced around the unit circle in the complex plane:

ω_n^k = e^2πik/n = cos(2πk/n) + i·sin(2πk/n), k = 0, 1, ..., n-1

Here ω_n = e^2πi/n is the principal n-th root of unity — the first one counterclockwise from 1 on the unit circle. All other n-th roots are powers of ω_n.

For example, the 4th roots of unity are ω₄⁰ = 1, ω₄¹ = i, ω₄² = -1, ω₄³ = -i. They sit at the four compass points of the unit circle: east, north, west, south. The 8th roots add four more points in between (northeast, northwest, southwest, southeast). The 16th roots add eight more, and so on — each doubling fills in the gaps.

Notice that 1 is always a root of unity (for any n), and so is -1 (when n is even). The roots always come in conjugate pairs: if a + bi is a root, so is a - bi. This is why real-valued inputs produce conjugate-symmetric DFT outputs.

The Three Key Properties

Roots of unity have three properties that make the FFT possible. Without any one of them, the algorithm breaks.

Property 1: Cancellation Lemma. For any n ≥ 0, any nonzero k, and any d ≥ 1: ω_dn^dk = ω_n^k. In words: taking every d-th root of unity from dn roots gives you the n roots. This means the (n/2)-th roots of unity are a subset of the n-th roots — which is exactly what lets us recurse.

Property 2: Halving Lemma. If n is even, the squares of the n-th roots of unity are the (n/2)-th roots of unity. Formally: (ω_n^k)² = ω_n/2^k. This is a corollary of cancellation. It means squaring the 8th roots gives us the 4th roots. Squaring halves the problem.

Property 3: Summation Lemma. For any n ≥ 1 and any k not divisible by n: ∑_j=0^n-1 (ω_n^k)^j = 0. The n-th roots of unity sum to zero. This is the orthogonality property that makes the inverse DFT work.

The halving lemma is the engine of the FFT. When we split a polynomial into its even- and odd-indexed coefficients, the evaluation points square — and by the halving lemma, n evaluation points become n/2 evaluation points. That is the divide step that gives us O(n log n).

Hand Computation: 8th Roots of Unity

Let us compute the 8th roots concretely. ω₈ = e^2πi/8 = e^πi/4 = cos(45°) + i·sin(45°) = (√2/2) + i(√2/2) ≈ 0.707 + 0.707i.

ω₈⁰ = 1                // 0°
ω₈¹ ≈ 0.707 + 0.707i // 45°
ω₈² = i                // 90°
ω₈³ ≈ -0.707 + 0.707i // 135°
ω₈⁴ = -1               // 180°
ω₈⁵ ≈ -0.707 - 0.707i // 225°
ω₈⁶ = -i               // 270°
ω₈⁷ ≈ 0.707 - 0.707i // 315°

Now verify the halving lemma. Square each 8th root: (ω₈⁰)² = 1, (ω₈¹)² = i, (ω₈²)² = -1, (ω₈³)² = -i. Those are exactly the 4th roots! And the pattern repeats: (ω₈⁴)² = 1, (ω₈⁵)² = i, etc. Eight points collapsed to four. Squaring those four gives two points (the 2nd roots: 1 and -1). One more squaring gives one point (the 1st root: 1). Three halvings from 8 to 1 — that is log₂(8) = 3 recursion levels.

Verify: summation lemma. Sum all 4th roots: 1 + i + (-1) + (-i) = 0. Sum all 8th roots: the four extra roots also cancel in pairs (0.707+0.707i cancels with -0.707-0.707i, etc.). For any n ≥ 2, the n-th roots of unity sum to exactly zero.

Roots of Unity on the Unit Circle

The n-th roots of unity are evenly spaced on the unit circle. Use the slider to change n. Notice how n=8 roots include all n=4 roots (halving lemma). Click a root to see its value.

n 8

Concept check: The halving lemma says that squaring the n-th roots of unity gives you the (n/2)-th roots. Why is this critical for the FFT?

It makes the roots real-valued, simplifying computation It lets us recurse: n evaluation points become n/2, giving T(n) = 2T(n/2) + O(n) It cancels out the imaginary parts It makes all roots equal to 1

Chapter 3: The DFT

Now we can define the evaluation step precisely. The Discrete Fourier Transform (DFT) of a coefficient vector a = (a₀, a₁, ..., a_n-1) is a new vector y = (y₀, y₁, ..., y_n-1) where:

y_k = ∑_j=0^n-1 a_j · ω_n^jk, k = 0, 1, ..., n-1

In words: y_k is the polynomial A evaluated at ω_n^k. The DFT evaluates a polynomial at all n-th roots of unity simultaneously.

This is a matrix-vector product. We can write y = F_n · a where F_n is the DFT matrix:

F_n[j, k] = ω_n^jk

For n = 4, with ω = ω₄ = e^2πi/4 = i, the DFT matrix has a beautiful structure. Every entry is a power of i, and since i⁴ = 1, there are only four possible values: 1, i, -1, -i. The matrix looks like:

F₄ = [ 1 1    1    1 ]
      [ 1 ω ω² ω³]
      [ 1 ω² ω⁴ ω⁶]
      [ 1 ω³ ω⁶ ω⁹]

    = [ 1    1 1 1 ]
      [ 1    i -1 -i ]
      [ 1 -1 1 -1 ]
      [ 1 -i -1 i ]

The first row is all 1s (evaluating at ω⁰ = 1 just sums the coefficients). The first column is all 1s (every root raised to the 0th power is 1). Each subsequent row rotates faster around the unit circle.

Look at the structure more carefully. Row 0 has all entries equal to 1 — evaluating at x=1 just sums all coefficients. Row 1 cycles through all n-th roots once. Row 2 cycles through them twice (hitting every other root). Row k cycles through k times. The DFT matrix is a frequency analysis machine: each row "tests" a different frequency by correlating the input signal with a sinusoid of that frequency.

This is exactly what a tuning fork does. A tuning fork resonates (vibrates strongly) when it encounters its natural frequency in a sound wave, and stays silent otherwise. Each row of the DFT matrix is a mathematical tuning fork for a different frequency. The output value y_k tells you how much frequency k is present in the input signal.

Computing y = F_n · a naively is an n × n matrix-vector multiply: O(n²). That is no faster than distributing terms. The FFT computes this same product in O(n log n) by exploiting the structure of F_n — specifically, the halving lemma.

Think of the DFT matrix as a change of basis. The coefficient vector a lives in the "time domain" — it describes the polynomial by its coefficients. The DFT output y lives in the "frequency domain" — it describes the same polynomial by its values at the roots of unity. The matrix F_n is the change-of-basis matrix between these two representations.

A key property: F_n is a unitary matrix (up to scaling). Its columns are orthogonal, and each column has the same norm. This means the DFT preserves energy: the sum of |a_k|² equals (1/n) times the sum of |y_k|². This is Parseval's theorem, and it is crucial in signal processing — the energy of a signal in the time domain equals its energy in the frequency domain.

Key insight. The DFT matrix is not an arbitrary matrix. Its entries are powers of a single root of unity, and the halving lemma means the matrix has recursive structure. The FFT is just a fast algorithm for multiplying by this very specific matrix. It does not work for arbitrary matrices.

The DFT Matrix Visualized

Each cell shows ω_n^jk as a point on the unit circle. Row j, column k. The color encodes the angle: warm = positive real, teal = positive imaginary, purple = negative real, pink = negative imaginary. Toggle n to see how the matrix grows.

n 4

A Hand Computation

Let us compute the DFT of a = (1, 2, 3, 4) with n = 4. We need ω₄ = e^2πi/4 = i.

y₀ = 1·1 + 2·1 + 3·1 + 4·1 = 10
y₁ = 1·1 + 2·i + 3·(-1) + 4·(-i) = 1 + 2i - 3 - 4i = -2 - 2i
y₂ = 1·1 + 2·(-1) + 3·1 + 4·(-1) = 1 - 2 + 3 - 4 = -2
y₃ = 1·1 + 2·(-i) + 3·(-1) + 4·i = 1 - 2i - 3 + 4i = -2 + 2i

So DFT(1, 2, 3, 4) = (10, -2-2i, -2, -2+2i). Notice y₃ is the complex conjugate of y₁ — this always happens for real-valued inputs, and it means we only need to compute half the outputs in practice.

Conjugate symmetry for real inputs. If the input vector a is real-valued (no imaginary parts), then y_k = conjugate(y_n-k) for all k. This means the DFT output is fully determined by its first n/2 + 1 elements. Real-valued FFT libraries (like NumPy's rfft) exploit this for 2× speedup and 2× memory savings.

Interpreting the DFT Output

What do the DFT values mean? For the polynomial interpretation, y_k = A(ω^k) is simply the polynomial evaluated at the k-th root of unity. But for signal processing, the interpretation is richer:

Element	Meaning
y₀	DC component — the sum (or average) of the signal. No oscillation.
y₁	Amplitude and phase of the fundamental frequency — one cycle per signal period.
y_k	Amplitude and phase of the k-th harmonic — k cycles per signal period.
y_n/2	Nyquist frequency — the highest frequency representable at this sampling rate.

The magnitude |y_k| tells you how strong frequency k is. The angle arg(y_k) tells you the phase shift of that frequency component. Together, they give the complete frequency decomposition of the signal.

Concept check: What is y₀ (the first element of the DFT) for any input vector (a₀, a₁, ..., a_n-1)?

a₀ — just the first coefficient a₀ + a₁ + ... + a_n-1 — the sum of all coefficients (because ω⁰ = 1) 0 — the roots cancel out n · a₀ — scaled by the number of points

Chapter 4: The FFT Algorithm

This is the heart of the chapter. We want to compute the DFT in O(n log n) instead of O(n²). The trick is divide and conquer on the polynomial itself.

The key question: how do we split the polynomial? We cannot split by degree (first half / second half of coefficients) because that does not create two independent subproblems at the same evaluation points. Instead, we split by parity of index: even-indexed coefficients go to one subproblem, odd-indexed to the other.

Given A(x) = a₀ + a₁x + a₂x² + a₃x³ + ... + a_n-1x^n-1, separate the even-indexed and odd-indexed coefficients:

A_even(x) = a₀ + a₂x + a₄x² + ...
A_odd(x) = a₁ + a₃x + a₅x² + ...

Then A(x) = A_even(x²) + x · A_odd(x²). This is an identity — verify it by substituting back. For example, with A(x) = 1 + 2x + 3x² + 4x³:

A_even(y) = 1 + 3y       // even-indexed coeffs: a₀=1, a₂=3
A_odd(y) = 2 + 4y       // odd-indexed coeffs: a₁=2, a₃=4

A_even(x²) + x·A_odd(x²)
  = (1 + 3x²) + x(2 + 4x²)
  = 1 + 2x + 3x² + 4x³ = A(x) ✓

The key insight: both A_even and A_odd are polynomials of degree n/2 - 1, and they are both evaluated at x². We have split one degree-(n-1) evaluation problem into two degree-(n/2-1) evaluation problems.

Now here is where the roots of unity work their magic. We want to evaluate A at ω_n⁰, ω_n¹, ..., ω_n^n-1. When we plug these in to A(x) = A_even(x²) + x · A_odd(x²), we need to evaluate A_even and A_odd at the squares of the n-th roots. By the halving lemma, squaring the n-th roots gives the (n/2)-th roots, each appearing twice:

{(ω_n⁰)², (ω_n¹)², ..., (ω_n^n-1)²} = {ω_n/2⁰, ω_n/2¹, ..., ω_n/2^n/2-1, ω_n/2⁰, ω_n/2¹, ..., ω_n/2^n/2-1}

So evaluating A_even(x²) and A_odd(x²) at the n-th roots means evaluating A_even and A_odd at the (n/2)-th roots — a DFT of half the size! This is the divide step: one size-n DFT becomes two size-(n/2) DFTs, plus O(n) work to combine them.

The FFT recursion. To compute DFT_n(a):
1. Split a into even-indexed and odd-indexed coefficients.
2. Recursively compute DFT_n/2(even) and DFT_n/2(odd).
3. Combine using: y_k = even_k + ω_n^k · odd_k and y_k+n/2 = even_k - ω_n^k · odd_k.
This combine step is called the butterfly operation.

Why does the combine step have that plus/minus form? Let us derive it. For k in [0, n/2):

y_k = A(ω_n^k)
  = A_even((ω_n^k)²) + ω_n^k · A_odd((ω_n^k)²)
  = A_even(ω_n/2^k) + ω_n^k · A_odd(ω_n/2^k)
  = even_k + ω_n^k · odd_k

For k + n/2 (the second half of the output):

y_k+n/2 = A(ω_n^k+n/2)
  = A_even((ω_n^k+n/2)²) + ω_n^k+n/2 · A_odd((ω_n^k+n/2)²)
  = A_even(ω_n^2k+n) + ω_n^k+n/2 · A_odd(ω_n^2k+n)
  = A_even(ω_n^2k · ω_nⁿ) + (ω_n^k · ω_n^n/2) · A_odd(ω_n^2k · ω_nⁿ)
  (underlined terms: ω_nⁿ = 1, ω_n^n/2 = -1)
  = A_even(ω_n/2^k) + (-1) · ω_n^k · A_odd(ω_n/2^k)
  = even_k - ω_n^k · odd_k

The key step: ω_n^n/2 = e^πi = -1. The point halfway around the unit circle is exactly -1. So the second half of the output just negates the twiddle factor. One multiplication, two outputs. This is the butterfly.

Tracing the Recursion for n = 4

Let us trace FFT([1, 2, 3, 4]) step by step to see the machinery in action.

Level 0: FFT([1, 2, 3, 4])
  even = [1, 3], odd = [2, 4]

Level 1a: FFT([1, 3])
  even = [1], odd = [3]
  FFT([1]) = [1], FFT([3]) = [3]
  ω₂ = e^πi = -1
  y[0] = 1 + 1·3 = 4    y[1] = 1 - 1·3 = -2
  Result: [4, -2]

Level 1b: FFT([2, 4])
  even = [2], odd = [4]
  FFT([2]) = [2], FFT([4]) = [4]
  y[0] = 2 + 1·4 = 6    y[1] = 2 - 1·4 = -2
  Result: [6, -2]

Back to Level 0: combine [4,-2] with [6,-2]
  ω₄ = i
  k=0: t = ω⁰·6 = 6     y[0] = 4+6 = 10    y[2] = 4-6 = -2
  k=1: t = ω¹·(-2) = -2i y[1] = -2+(-2i) = -2-2i y[3] = -2-(-2i) = -2+2i

Final: [10, -2-2i, -2, -2+2i]

This matches our hand computation from Chapter 3. The recursive FFT computed the same answer with only 4 complex multiplications (at the combine steps) instead of the 16 required by the naive DFT matrix multiply.

Complexity Analysis

At each level of recursion, we do O(n) work (the n/2 butterfly operations). The recursion halves the problem each time, so there are log₂(n) levels. Total: T(n) = 2T(n/2) + O(n) = O(n log n).

More precisely, at each level we perform n/2 complex multiplications (the twiddle factors) and n complex additions/subtractions (the butterfly sums). Over log₂(n) levels, that is (n/2)·log₂(n) multiplications and n·log₂(n) additions. Since a complex multiplication takes 4 real multiplications and 2 real additions, the total real operation count is about 5n·log₂(n).

n	Naive O(n²)	FFT O(n log n)	Speedup
2⁴ = 16	256	64	4×
2⁸ = 256	65,536	2,048	32×
2¹⁰ = 1,024	1,048,576	10,240	102×
2¹⁶ = 65,536	4.3 × 10⁹	1,048,576	4,096×
2²⁰ = 1M	10¹²	20,971,520	50,000×

At n = 1 million, the FFT is fifty thousand times faster. This is why the FFT matters: it makes previously intractable problems routine.

The Code

python
import cmath

def fft(a):
    """Recursive FFT. Input length must be a power of 2."""
    n = len(a)
    if n == 1:
        return a  # base case: DFT of single element is itself

    # principal n-th root of unity
    omega_n = cmath.exp(2j * cmath.pi / n)
    omega = 1

    # split into even and odd
    a_even = a[0::2]  # indices 0, 2, 4, ...
    a_odd  = a[1::2]  # indices 1, 3, 5, ...

    # recurse
    y_even = fft(a_even)
    y_odd  = fft(a_odd)

    # combine (butterfly operations)
    y = [0] * n
    for k in range(n // 2):
        t = omega * y_odd[k]        # twiddle factor
        y[k]         = y_even[k] + t  # top wing
        y[k + n//2] = y_even[k] - t  # bottom wing
        omega *= omega_n
    return y

Each line maps directly to the math. The variable omega accumulates ω_n^k as k increments. The variable t is the twiddle factor — the term ω_n^k · odd_k that gets added to and subtracted from the even result. One complex multiplication produces two outputs. This is what makes the FFT efficient: each butterfly does 1 multiply for 2 outputs.

The Iterative (Bottom-Up) FFT

The recursive FFT is elegant but has overhead from function calls and array slicing. In practice, most implementations use an iterative version. The idea: process the butterfly stages bottom-up, from smallest to largest. The input must first be placed in bit-reversed order — index k goes to position reverse(k), where reverse flips the binary digits of k.

Why bit-reversal? The recursive FFT splits even/odd repeatedly. After log n splits, element at index k ends up at position bitReverse(k). The iterative version does this permutation upfront, then runs the butterfly stages in order.

python
import cmath

def fft_iterative(a, invert=False):
    """Iterative FFT using bit-reversal permutation."""
    n = len(a)
    a = a[:]  # copy
    log_n = n.bit_length() - 1

    # bit-reversal permutation
    for i in range(n):
        rev = int(bin(i)[2:].zfill(log_n)[::-1], 2)
        if i < rev:
            a[i], a[rev] = a[rev], a[i]

    # butterfly stages
    length = 2
    while length <= n:
        angle = 2 * cmath.pi / length * (-1 if invert else 1)
        omega_n = cmath.exp(1j * angle)

        for start in range(0, n, length):
            omega = 1
            for j in range(length // 2):
                u = a[start + j]
                t = omega * a[start + j + length // 2]
                a[start + j]                = u + t
                a[start + j + length // 2] = u - t
                omega *= omega_n
        length *= 2

    if invert:
        a = [x / n for x in a]
    return a

The iterative version runs in-place with O(1) extra memory (beyond the input array). This matters in practice: modern CPUs are cache-bound, and in-place operations are significantly faster than allocating new arrays at each recursion level.

Recursive vs Iterative: The Tradeoffs

Aspect	Recursive FFT	Iterative FFT
Code clarity	Very clean, maps directly to math	More complex, bit-reversal upfront
Memory	O(n log n) stack + new arrays	O(1) beyond input (in-place)
Cache behavior	Poor (scattered memory access)	Good (sequential access per stage)
Practical speed	Slower due to allocation overhead	Faster — used by all production libraries
When to use	Teaching, prototyping, interviews	Production code, competitive programming

For interviews, the recursive version is usually fine — it is cleaner and easier to get right under pressure. For production, always use a library (NumPy, FFTW, cuFFT) that uses the iterative approach with platform-specific optimizations.

Bit-reversal example for n=8. Index 0 = 000 → 000 = 0. Index 1 = 001 → 100 = 4. Index 2 = 010 → 010 = 2. Index 3 = 011 → 110 = 6. Index 4 = 100 → 001 = 1. Index 5 = 101 → 101 = 5. Index 6 = 110 → 011 = 3. Index 7 = 111 → 111 = 7. So the bit-reversed order is [0, 4, 2, 6, 1, 5, 3, 7].

Bit-Reversal Permutation

The iterative FFT starts by rearranging the input in bit-reversed order. Watch how index k (binary) maps to position reverse(k). This ensures the butterfly stages process elements in the correct pairs.

FFT Butterfly Diagram for n = 8

Watch the FFT process an 8-element input. Each stage halves the subproblem. Lines show how values combine: solid = add, dashed = subtract (with twiddle factor). Click Step to advance one stage.

Click Step to advance one stage

Concept check: In the FFT butterfly operation, we compute y_k = even_k + ω^k·odd_k and y_k+n/2 = even_k - ω^k·odd_k. Why do we get two outputs from one multiplication?

Because ω^k+n/2 = -ω^k, so the second output just negates the twiddle factor Because complex multiplication always produces two real numbers Because we pad the input to double its size Because the even and odd halves are the same

Chapter 5: Inverse FFT

We can now evaluate a polynomial at all n-th roots of unity in O(n log n) time — converting from coefficient form to point-value form. And we know that in point-value form, multiplication is just O(n) pointwise multiplication. But after multiplying, we have a product polynomial in point-value form. To be useful, we need to convert it back to coefficient form. We need the inverse DFT.

The beautiful surprise: the inverse DFT is almost the same as the forward DFT. This means we can reuse 100% of our FFT code for both directions.

Recall the DFT is y = F_n · a. The inverse is a = F_n^-1 · y. What is F_n^-1?

Here is the beautiful result: the inverse DFT matrix is almost the same as the forward DFT matrix. Specifically:

F_n^-1[j, k] = (1/n) · ω_n^-jk

The only differences from the forward DFT: (1) replace ω_n with ω_n^-1 (the conjugate root — go clockwise instead of counterclockwise), and (2) divide by n at the end.

Inverse FFT = Forward FFT with two tweaks. To compute the inverse DFT:
1. Run the exact same FFT algorithm, but use ω_n^-1 = e^-2πi/n instead of ω_n.
2. Divide every output element by n.
That is it. You reuse 100% of the FFT code. This is why the summation lemma matters — it is what makes F_n · F_n^-1 = I.

Why does this work? Let us prove it. We need to show that F_n · F_n^-1 = I, where F_n^-1[j,k] = (1/n)ω_n^-jk.

The (i, j)-th entry of F_n · F_n^-1 is:

∑_k=0^n-1 ω_n^ik · (1/n) · ω_n^-kj = (1/n) ∑_k=0^n-1 ω_n^k(i-j)

By the summation lemma, this sum equals n when i = j (each term is ω⁰ = 1, summed n times), and 0 when i ≠ j (the n-th roots of unity at power i-j cancel). So (1/n) · n = 1 on the diagonal and (1/n) · 0 = 0 elsewhere. That is the identity matrix. QED.

The deep symmetry. The DFT and its inverse differ only by the sign of the exponent and a scale factor of 1/n. This symmetry means that anything you can do in the time domain has an exact dual in the frequency domain. This duality is the foundation of all Fourier analysis.

Numerical Precision

One practical concern: the FFT uses floating-point arithmetic, and complex exponentials are irrational numbers. After a forward FFT and inverse FFT roundtrip, the recovered coefficients are approximately equal to the original — not exact. For integer coefficients, the error is typically less than 10^-10, so rounding to the nearest integer gives exact results. But for very large n (millions), errors can accumulate.

This is one motivation for the Number Theoretic Transform (NTT), which works over finite fields and gives exact results. We will discuss the NTT in Chapter 9.

The Complete Pipeline

We now have all three pieces for O(n log n) polynomial multiplication:

# Step 1: Pad to length 2n (power of 2), then forward FFT
ya = FFT(a) # O(n log n)
yb = FFT(b) # O(n log n)

# Step 2: Pointwise multiply
yc = [ya[k] * yb[k] for k in range(2n)] # O(n)

# Step 3: Inverse FFT to get coefficients
c = IFFT(yc) # O(n log n)

python
import cmath

def ifft(y):
    """Inverse FFT. Same as FFT but with conjugate roots, then divide by n."""
    n = len(y)
    if n == 1:
        return y

    omega_n = cmath.exp(-2j * cmath.pi / n)  # NEGATIVE exponent
    omega = 1

    y_even = y[0::2]
    y_odd  = y[1::2]

    a_even = ifft(y_even)
    a_odd  = ifft(y_odd)

    a = [0] * n
    for k in range(n // 2):
        t = omega * a_odd[k]
        a[k]         = a_even[k] + t
        a[k + n//2] = a_even[k] - t
        omega *= omega_n
    return a

def inverse_fft(y):
    """Full inverse: run ifft then divide by n."""
    n = len(y)
    a = ifft(y)
    return [x / n for x in a]

Forward FFT → Inverse FFT Roundtrip

Enter coefficients on the left. Forward FFT produces frequency-domain values (middle). Inverse FFT recovers the original coefficients (right). Verify they match!

a₀ 1

a₁ 2

a₂ 3

a₃ 4

Concept check: What are the two differences between the forward FFT and the inverse FFT?

Use conjugate root (ω^-1 instead of ω) and divide all outputs by n Reverse the input array and multiply by n Swap even and odd indices, then add n to each element There is no difference — the same algorithm is used for both

Chapter 6: Convolution

This chapter is the payoff. Everything we have built — polynomial representations, roots of unity, the DFT, the FFT, the inverse FFT — comes together in one of the most powerful ideas in applied mathematics: convolution.

Polynomial multiplication has a name in signal processing: convolution. When you multiply A(x) · B(x) = C(x), the coefficient c_k of the product is:

c_k = ∑_j=0^k a_j · b_k-j

This is the convolution sum, often written c = a * b (the asterisk means convolution, not multiplication). It appears everywhere in science and engineering: when you blur an image (convolve the image with a kernel), when you filter audio (convolve the signal with a filter), when you compute the probability distribution of the sum of two random variables, when you cross-correlate two time series, when you apply a moving average. In every case, the naive computation is O(n²) and the FFT makes it O(n log n).

The connection to polynomial multiplication is immediate: the k-th coefficient of C = A · B is the convolution of the coefficient vectors of A and B. So every technique for fast polynomial multiplication automatically gives fast convolution.

The Convolution Theorem. Convolution in the time/spatial domain equals pointwise multiplication in the frequency domain. If c = a * b (convolution), then FFT(c) = FFT(a) · FFT(b) (pointwise). This is the single most important result in signal processing.

Think of it this way. You have a signal (a sequence of numbers representing audio samples). You have a filter (another sequence — maybe a low-pass filter that removes high-frequency noise). To apply the filter, you convolve the signal with the filter. The naive approach slides the filter across the signal, computing a dot product at each position — O(n²). The FFT approach: transform both to frequency domain, multiply pointwise, transform back — O(n log n).

Why Does the Convolution Theorem Work?

This is not magic — it follows directly from what we already know. The coefficient vector of the product polynomial A · B is the convolution of the coefficient vectors of A and B. And we showed that polynomial multiplication can be done by: FFT, pointwise multiply, IFFT. Combining these two facts:

Convolution of a and b

= coefficient vector of A(x) · B(x)

↓

= IFFT( FFT(a) ⊙ FFT(b) )

where ⊙ is pointwise multiplication

That is the convolution theorem: convolution in the time domain equals pointwise multiplication in the frequency domain. The FFT is simply the efficient way to switch between domains.

python
import numpy as np

def convolve_fft(signal, kernel):
    """Convolve signal with kernel using FFT."""
    n = len(signal) + len(kernel) - 1
    N = 1
    while N < n:
        N <<= 1

    # FFT both, pointwise multiply, IFFT
    S = np.fft.fft(signal, N)
    K = np.fft.fft(kernel, N)
    return np.fft.ifft(S * K).real[:n]

# Verify against numpy's direct convolution
s = [1, 2, 3, 4, 5]
k = [0.2, 0.6, 0.2]  # smoothing kernel
print(np.allclose(convolve_fft(s, k), np.convolve(s, k)))  # True

Step-by-Step Polynomial Multiplication

Let us multiply A(x) = 1 + 2x + 3x² and B(x) = 2 + 1x using the full FFT pipeline.

Step 1: Pad to length 4 (next power of 2 ≥ deg(A) + deg(B) + 1 = 4)
a = [1, 2, 3, 0] ← padded with one zero
b = [2, 1, 0, 0] ← padded with two zeros

Step 2: Forward FFT (evaluate at 4th roots: 1, i, -1, -i)
A(1) = 1+2+3+0 = 6 B(1) = 2+1+0+0 = 3
A(i) = 1+2i-3+0 = -2+2i B(i) = 2+i+0+0 = 2+i
A(-1) = 1-2+3+0 = 2 B(-1) = 2-1+0+0 = 1
A(-i) = 1-2i-3+0 = -2-2i B(-i) = 2-i+0+0 = 2-i

Step 3: Pointwise multiply
C(1) = 6·3 = 18
C(i) = (-2+2i)(2+i) = -4-2i+4i+2i² = -6+2i
C(-1) = 2·1 = 2
C(-i) = (-2-2i)(2-i) = -4+2i-4i+2i² = -6-2i

Step 4: Inverse FFT
c = IFFT([18, -6+2i, 2, -6-2i]) = [2, 5, 8, 3]

Result: C(x) = 2 + 5x + 8x² + 3x³

Verify: (1 + 2x + 3x²)(2 + x) = 2 + x + 4x + 2x² + 6x² + 3x³ = 2 + 5x + 8x² + 3x³. Correct.

Convolution in Probability

Here is a beautiful application that makes convolution tangible. Roll two fair dice. What is the probability of each possible sum (2 through 12)?

Each die has probabilities [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] for outcomes [1, 2, 3, 4, 5, 6]. The distribution of the sum of two independent random variables is the convolution of their individual distributions.

P(sum = k) = ∑_j P(die 1 = j) · P(die 2 = k-j)

P(sum = 2) = (1/6)(1/6) = 1/36
P(sum = 3) = (1/6)(1/6) + (1/6)(1/6) = 2/36
P(sum = 4) = (1/6)(1/6) + (1/6)(1/6) + (1/6)(1/6) = 3/36
⋮
P(sum = 7) = 6/36 (most likely sum)
⋮
P(sum = 12) = 1/36

The convolution of two uniform distributions gives a triangular distribution. If you had 100 dice (combinatorics exam question!), the naive approach requires summing over all combinations — but the FFT computes the convolution of 100 distributions in O(n · 100 · log(n·100)) time by convolving them pairwise.

The pattern. Whenever you see "distribution of a sum" — sum of dice, total delay through a network, combined noise from multiple sources — that is convolution. And convolution means FFT can speed it up.

FFT Polynomial Multiplication — Full Pipeline

Watch the three-step process: forward FFT on both polynomials, pointwise multiply, inverse FFT. Adjust coefficients of A and B. The result C = A · B is shown below.

A: degree 2

B: degree 2

Signal Convolution

Now the real payoff. Instead of abstract polynomials, think of the coefficients as signal samples. A signal a = [a₀, a₁, ..., a_n-1] represents sound pressure at each time step. A kernel b = [b₀, b₁, ..., b_m-1] represents a filter (like an averaging window for smoothing, or a difference operator for edge detection).

Convolution slides the kernel across the signal, computing a weighted sum at each position. The output c_k = ∑ a_j · b_k-j is the filtered signal.

What Each Kernel Does

A smoothing kernel (box filter) is just [1/w, 1/w, ..., 1/w] — it averages w adjacent samples. This removes high-frequency noise but blurs sharp edges. A Gaussian kernel weights the center more heavily than the edges, giving a smoother blur.

An edge detection kernel like [-1, 0, 1] computes the discrete derivative — the difference between adjacent samples. Wherever the signal changes sharply, the output spikes. This is how Photoshop's "find edges" filter works.

A sharpening kernel emphasizes differences: it subtracts a blurred version of the signal from the original, amplifying edges. The kernel [0, -1, 5, -1, 0] (center-heavy with negative sides) achieves this.

An echo kernel like [0.7, 0, 0, ..., 0, 0.3] adds a delayed copy of the signal to itself. In audio, this literally creates an echo — you hear the original sound followed by a quieter repetition. Professional reverb effects use kernels with thousands of samples (the impulse response of a physical space, recorded by playing a loud click in a concert hall and recording the result). Convolving dry audio with a hall impulse response gives you audio that sounds like it was recorded in that hall. The FFT makes this real-time.

Why FFT convolution beats direct convolution. Direct convolution of a signal of length n with a kernel of length k takes O(n·k) operations. For small k (like a 3-tap filter), this is fine — O(n). But for large kernels (reverb impulse responses can be 100,000+ samples), direct convolution is slow. FFT convolution is always O(n log n) regardless of kernel size, because it transforms both, multiplies pointwise, and transforms back.

Signal Convolution via FFT

Top: input signal (orange). Middle: convolution kernel (teal). Bottom: convolved output (purple). Choose different kernel types and watch how the signal changes. All computed via FFT!

Kernel width 5

Concept check: The convolution theorem states that convolution in the time domain equals _____ in the frequency domain.

Addition Pointwise multiplication Convolution again (it is the same in both domains) Division

Chapter 7: Applications

The FFT is not merely a theoretical curiosity — it is one of the most practically consequential algorithms in computing. Every application below would be intractable without it.

1. Signal Processing and Spectral Analysis

A microphone records sound as a sequence of pressure values over time — the time domain representation. But sound is really a sum of sinusoidal waves at different frequencies. The FFT converts from time domain to frequency domain, revealing which frequencies are present and how strong each one is. This is called spectral analysis.

Your phone does this in real time when it identifies a song (Shazam), processes your voice (speech recognition), or noise-cancels your AirPods. The FFT runs millions of times per second in these applications.

A concrete example: CD-quality audio is sampled at 44,100 Hz (44,100 samples per second). To analyze the frequency content of a 1-second clip, you compute the FFT of 44,100 samples. The output gives you 22,050 frequency bins (by conjugate symmetry), each representing a frequency from 0 Hz to 22,050 Hz. The magnitude of each bin tells you how loud that frequency is. A piano note at middle C (262 Hz) will show a peak at bin 262 and its harmonics (524 Hz, 786 Hz, etc.).

For real-time analysis, you use the Short-Time Fourier Transform (STFT): take overlapping windows of ~1,024 samples, FFT each window, and stack the results. The output is a spectrogram — a 2D plot with time on one axis, frequency on the other, and color representing magnitude. Spectrograms are the standard visualization for audio in music production, speech analysis, and machine learning.

How noise cancellation works. Record ambient noise. FFT it to get the frequency spectrum. Generate a signal with the same frequencies but inverted phase. Play it through the speaker. The noise and anti-noise cancel out. All of this happens in under 10 milliseconds, and the bottleneck is the FFT.

2. Image Processing

A 2D image is a grid of pixel values. Blurring, sharpening, and edge detection are all 2D convolutions — sliding a small kernel (like a 3×3 Gaussian) across the image. For a 4K image (8 million pixels) and a large kernel, naive convolution is slow. The 2D FFT (applying the 1D FFT to rows, then columns) makes it O(N log N) where N is the total number of pixels.

Every time you apply a filter in Photoshop or Instagram, the 2D FFT is running under the hood. MRI scanners also use the 2D (and 3D) FFT: the scanner collects data in "k-space" (the frequency domain), and the inverse FFT converts it to the spatial image your doctor sees. A single MRI scan requires millions of FFT operations.

3. Big Integer Multiplication

Multiplying two n-digit integers is exactly polynomial multiplication — treat each digit as a coefficient. The number 1234 = 4 + 3·10 + 2·100 + 1·1000 = polynomial A(x) = 4 + 3x + 2x² + x³ evaluated at x = 10.

So to multiply two n-digit numbers: treat them as polynomials, multiply using FFT, then propagate carries (because digit values can exceed 9 after multiplication). Grade-school multiplication is O(n²). The FFT-based Schönhage-Strassen algorithm (1971) multiplies n-digit integers in O(n log n log log n). This is what Python uses internally for very large integers (thousands of digits). In 2019, Harvey and van der Hoeven achieved the theoretically optimal O(n log n), matching the FFT bound exactly.

4. String Matching via Convolution

Surprise: you can find all occurrences of a pattern in a text using convolution. Encode the text and pattern as polynomials (character values as coefficients), convolve, and read off matches from the output. The trick: reverse the pattern before convolving, so that the convolution computes the dot product of the pattern with each window of the text.

This is especially useful for wildcard matching where the pattern contains "don't care" characters — a case where KMP and Boyer-Moore struggle. With wildcards, define a match score polynomial: for each text position, the convolution output gives the number of matching characters. Positions where the score equals the pattern length (minus wildcards) are matches. The FFT computes all scores simultaneously in O(n log n).

5. Audio Compression (MP3, AAC)

MP3 compression works by transforming audio to the frequency domain using a cousin of the FFT called the Modified Discrete Cosine Transform (MDCT). In the frequency domain, the encoder identifies which frequencies are inaudible to human ears (psychoacoustic model) and removes them. The result: 10× compression with minimal perceived quality loss.

6. Solving PDEs and Physics Simulations

Many differential equations (heat equation, wave equation, Schrödinger equation) become simple algebraic equations in the frequency domain. Transform, solve, transform back. This is called the spectral method and it powers weather prediction models, fluid dynamics simulations, and quantum chemistry calculations.

7. Machine Learning

In 2021, Google's FNet paper showed that replacing the attention mechanism in Transformers with a simple FFT achieves 92% of BERT's accuracy while being 7× faster on GPUs. The intuition: attention mixes information across token positions, and the FFT does the same thing (it mixes all input positions via the butterfly network). The FFT's fixed mixing pattern is less flexible than learned attention weights, but its O(n log n) complexity beats attention's O(n²).

The FFT also appears in spectral graph neural networks, where convolution on graphs is defined via the graph Fourier transform (eigendecomposition of the graph Laplacian).

8. Cryptography (Post-Quantum)

Post-quantum cryptography relies heavily on polynomial arithmetic over finite fields. The NIST-standardized algorithms CRYSTALS-Kyber (key encapsulation) and CRYSTALS-Dilithium (digital signatures) both use the Number Theoretic Transform (NTT) — the FFT over Z/qZ — as their core computational primitive. Every time you send an encrypted message using post-quantum crypto, the NTT is running.

The security of these systems relies on the hardness of the Ring Learning With Errors (RLWE) problem, which involves polynomial arithmetic in Z_q[x]/(xⁿ+1). Adding and multiplying polynomials in this ring requires modular reduction, and the NTT makes the multiplication step efficient. Without the FFT/NTT, post-quantum cryptography would be too slow for practical use.

Application	What's Convolved	Naive Cost	FFT Cost
Audio filtering	Signal × filter impulse response	O(n²)	O(n log n)
Image blur	Image × Gaussian kernel	O(N·k²)	O(N log N)
Big integer ×	Digit vectors	O(n²)	O(n log n)
Polynomial ×	Coefficient vectors	O(n²)	O(n log n)
String matching	Text × reversed pattern	O(nm)	O(n log n)
MRI reconstruction	k-space data → image	O(N²)	O(N log N)

Frequency Spectrum Analyzer

Compose a signal from up to 4 sine waves of different frequencies. The FFT decomposes it back into its constituent frequencies. Adjust frequency and amplitude sliders to see the spectrum change.

Freq 1 3

Amp 1 1.0

Freq 2 7

Amp 2 0.5

Practical Implementation Notes

A few things that textbooks often omit but matter in real code:

Zero-padding. The FFT requires the input length to be a power of 2 (for the standard Cooley-Tukey algorithm). If your data has length 1000, pad it to 1024 with zeros. This does not change the polynomial — it just adds higher-order zero coefficients. For convolution, you must pad to at least len(a) + len(b) - 1 (rounded up to the next power of 2) to avoid circular convolution artifacts.

Circular vs linear convolution. The DFT computes circular convolution: the signal wraps around. If you want standard (linear) convolution, you must zero-pad both signals to length ≥ len(a) + len(b) - 1 before applying the FFT. Forgetting this is the #1 FFT implementation bug.

In-place computation. Production FFT libraries (FFTW, MKL, cuFFT) work in-place and use the iterative bit-reversal approach. They also use mixed-radix decompositions (not just radix-2), SIMD vectorization, and cache-optimal memory access patterns. A well-tuned FFT library can be 10-100× faster than a naive recursive implementation.

When NOT to Use FFT Convolution

The FFT has overhead: you need to (1) zero-pad both arrays, (2) compute two forward FFTs, (3) compute pointwise products, and (4) compute one inverse FFT. For small inputs, this overhead is larger than the naive O(nk) cost.

The crossover point depends on the implementation and hardware, but as a rule of thumb: if the kernel has fewer than about 64 elements, direct convolution is faster. If the kernel has more than about 64 elements, FFT convolution wins. For very large kernels (thousands of elements), the FFT is dramatically faster.

In machine learning, most convolution layers use small kernels (3×3 or 5×5), so they use direct convolution (or the im2col + GEMM approach). But in audio processing, reverb kernels can have 100,000+ samples, and the FFT is essential.

Libraries like SciPy automatically choose between direct and FFT convolution based on input size — the function scipy.signal.fftconvolve always uses FFT, while scipy.signal.convolve picks the faster method.

Concept check: MP3 compression works by transforming audio to the frequency domain and removing inaudible frequencies. Which FFT-related algorithm does it use?

Standard DFT Inverse FFT only Modified Discrete Cosine Transform (MDCT), a cousin of the FFT Wavelet transform

Chapter 8: Interview Arsenal

Core Facts (Cheat Sheet)

Concept	Key Fact
DFT	Evaluate polynomial at n-th roots of unity: y_k = ∑ a_jω^jk
FFT	Compute DFT in O(n log n) via divide-and-conquer on even/odd coefficients
Inverse FFT	Same algorithm with ω^-1 instead of ω, divide by n
Butterfly	y_k = e_k + ω^ko_k, y_k+n/2 = e_k - ω^ko_k
Convolution theorem	Time-domain convolution = frequency-domain pointwise multiply
Poly multiplication	FFT both → pointwise multiply → IFFT. O(n log n) total
Input requirement	Length must be power of 2 (pad with zeros)
Roots of unity	ω_n^k = e^2πik/n. Halving: (ω_n^k)² = ω_n/2^k

Coding Drill: Full Polynomial Multiplication

python
import cmath

def fft(a, invert=False):
    """Unified FFT/IFFT. Set invert=True for inverse."""
    n = len(a)
    if n == 1:
        return a[:]

    # direction: +1 for forward, -1 for inverse
    angle = 2 * cmath.pi / n * (-1 if invert else 1)
    omega_n = cmath.exp(1j * angle)
    omega = 1

    a_even = fft(a[0::2], invert)
    a_odd  = fft(a[1::2], invert)

    result = [0] * n
    for k in range(n // 2):
        t = omega * a_odd[k]
        result[k]         = a_even[k] + t
        result[k + n//2] = a_even[k] - t
        omega *= omega_n

    if invert:
        result = [x / 2 for x in result]  # divide by 2 at each level = /n total
    return result

def poly_multiply(a, b):
    """Multiply polynomials a and b using FFT. Returns coefficient list."""
    # find result size and pad to power of 2
    result_len = len(a) + len(b) - 1
    n = 1
    while n < result_len:
        n <<= 1

    # pad with zeros
    a = a + [0] * (n - len(a))
    b = b + [0] * (n - len(b))

    # forward FFT
    fa = fft(a)
    fb = fft(b)

    # pointwise multiply
    fc = [fa[i] * fb[i] for i in range(n)]

    # inverse FFT
    c = fft(fc, invert=True)

    # round to integers if coefficients were integers
    return [round(x.real) for x in c[:result_len]]

# Test: (1 + 2x + 3x²) * (2 + x) = 2 + 5x + 8x² + 3x³
print(poly_multiply([1, 2, 3], [2, 1]))  # [2, 5, 8, 3]

When to Recognize FFT/Convolution in Interviews

The FFT rarely appears by name in coding interviews. Instead, the interviewer asks a problem that is secretly convolution:

Problem Pattern	Why It's Convolution
"Multiply two big numbers represented as arrays"	Digits are polynomial coefficients. Multiply polynomials, carry.
"Count pairs with difference k" or "histogram of sums"	Frequency count is a polynomial. Sum of two dice = convolution of two distributions.
"Find pattern with wildcards in text"	Encode as polynomials. Convolution gives match scores at each position.
"Multiply two polynomials mod x^n"	Direct FFT application. Common in competitive programming.
"Apply a 1D filter to a signal"	Filter = convolution kernel. FFT for large kernels.

Interview heuristic. If the naive solution involves a nested loop where the inner loop's index depends on the outer loop's index via subtraction (like c_k = ∑ a_j · b_k-j), it is probably convolution and the FFT can speed it up.

Coding Drill: Big Integer Multiplication

Treat digits as polynomial coefficients. Multiply using FFT. Then propagate carries.

python
def big_multiply(num1_str, num2_str):
    """Multiply two large integers represented as strings."""
    a = [int(d) for d in reversed(num1_str)]  # least significant first
    b = [int(d) for d in reversed(num2_str)]

    # polynomial multiplication via FFT
    c = poly_multiply(a, b)

    # propagate carries
    carry = 0
    result = []
    for val in c:
        total = val + carry
        result.append(total % 10)
        carry = total // 10
    while carry:
        result.append(carry % 10)
        carry //= 10

    return ''.join(str(d) for d in reversed(result)).lstrip('0') or '0'

# Test: 123 * 456 = 56088
print(big_multiply("123", "456"))  # "56088"

Coding Drill: Using NumPy's FFT

In practice, you will use a library. NumPy's FFT is highly optimized (FFTPACK under the hood).

python
import numpy as np

def poly_multiply_numpy(a, b):
    """Polynomial multiplication using NumPy's FFT."""
    result_len = len(a) + len(b) - 1
    n = 1
    while n < result_len:
        n <<= 1

    fa = np.fft.fft(a, n)
    fb = np.fft.fft(b, n)
    fc = fa * fb
    c = np.fft.ifft(fc).real

    return np.round(c[:result_len]).astype(int)

# 5 lines. That is it. The FFT does the heavy lifting.
print(poly_multiply_numpy([1,2,3], [2,1]))  # [2, 5, 8, 3]

Common Interview Mistakes

Mistake	Why It Fails	Fix
Forgetting to pad to power of 2	FFT requires n = 2^k	Pad with zeros: while n < len: n *= 2
Not padding enough for convolution	Get circular convolution artifacts	Pad to len(a) + len(b) - 1, rounded up
Using forward FFT for inverse	Wrong roots, no 1/n scaling	Negate exponent, divide by n
Off-by-one in output length	Product of deg-n and deg-m has deg n+m	Result has n+m+1 coefficients
Not rounding after IFFT	Floating-point noise gives 2.999999	Round to int if coefficients are integers

Complexity Comparison

O(n²) vs O(n log n): The Speedup

As n grows, the FFT's advantage becomes staggering. Drag the slider to see the operation count for naive vs FFT polynomial multiplication.

n (log scale) 1024

The Five Interview Dimensions

For FFT-related interview questions, make sure you can address these five aspects:

Dimension	What to Demonstrate
CONCEPT	Explain the three-step strategy (evaluate, multiply, interpolate). Why roots of unity? Why O(n log n)?
DESIGN	When to use FFT vs naive convolution (crossover around n ≈ 64). Zero-padding for linear convolution. Power-of-2 sizing.
CODE	Implement FFT from scratch (recursive or iterative). Use library FFT for polynomial multiply. NTT for modular arithmetic.
DEBUG	Common bugs: circular vs linear convolution, off-by-one in output length, forgetting to divide by n in IFFT.
FRONTIER	NTT for post-quantum crypto. FNet (FFT replacing attention). Sparse FFT for near-sparse signals. Sub-linear FFT.

Concept check: You need to multiply two degree-1000 polynomials. How many operations does the FFT approach take (approximately)?

About 1,000,000 (n²) About 20,000 (n log n, where n ≈ 2048 after padding) About 1,000 (linear in degree) About 1,000,000,000 (n³)

Chapter 9: Connections

The FFT connects to an extraordinary range of topics in computer science and mathematics. Here is how it fits into the CLRS universe and beyond.

Within CLRS

Chapter	Connection
Ch 4: Divide & Conquer	The FFT is a D&C algorithm: T(n) = 2T(n/2) + O(n) = O(n log n). The master theorem (Case 2) gives the complexity directly.
Ch 31: Number Theory	The FFT works over any ring with principal roots of unity — including Z/pZ for prime p. This gives the Number Theoretic Transform (NTT), which avoids floating-point errors entirely.
Ch 32: String Matching	Wildcard pattern matching via convolution: encode pattern and text as polynomials, convolve, and read off match positions from the output.

Beyond CLRS

Topic	How FFT Appears
Signal Processing	The foundation of DSP. STFT, spectrogram, filter design, noise cancellation.
Machine Learning	Efficient attention mechanisms (FNet replaces attention with FFT). Spectral graph convolutions.
Cryptography	NTT for polynomial operations in lattice-based post-quantum crypto (CRYSTALS-Kyber, Dilithium).
Competitive Programming	Any problem reducible to polynomial multiplication or convolution. ~15% of advanced CP problems use FFT.

The Number Theoretic Transform (NTT)

The FFT uses complex roots of unity, which require floating-point arithmetic. For exact computation (big integers, competitive programming), the NTT works in modular arithmetic. Choose a prime p such that p - 1 is divisible by a large power of 2 (popular choice: p = 998244353, where p - 1 = 2²³ · 119). The primitive root modulo p plays the role of ω. The algorithm is identical — just replace complex exponentiation with modular exponentiation.

Why does this work? Because a finite field Z/pZ (integers mod prime p) also has roots of unity. If g is a primitive root mod p (meaning g generates all nonzero elements), then ω = g^(p-1)/n is a principal n-th root of unity mod p. It satisfies all three properties: cancellation, halving, and summation — all in modular arithmetic.

python
# Number Theoretic Transform (NTT)
# Exact polynomial multiplication over integers mod p

MOD = 998244353   # NTT-friendly prime: 2^23 * 119 + 1
PRIM_ROOT = 3    # primitive root of MOD

def power_mod(base, exp, mod):
    """Fast modular exponentiation."""
    result = 1
    base %= mod
    while exp > 0:
        if exp & 1:
            result = result * base % mod
        exp >>= 1
        base = base * base % mod
    return result

def ntt(a, invert=False):
    """In-place NTT. Works exactly like iterative FFT but mod p."""
    n = len(a)
    j = 0
    for i in range(1, n):
        bit = n >> 1
        while j & bit:
            j ^= bit; bit >>= 1
        j ^= bit
        if i < j:
            a[i], a[j] = a[j], a[i]

    length = 2
    while length <= n:
        w = power_mod(PRIM_ROOT, (MOD - 1) // length, MOD)
        if invert:
            w = power_mod(w, MOD - 2, MOD)  # modular inverse
        for start in range(0, n, length):
            wn = 1
            for k in range(length // 2):
                u = a[start + k]
                v = a[start + k + length // 2] * wn % MOD
                a[start + k] = (u + v) % MOD
                a[start + k + length // 2] = (u - v) % MOD
                wn = wn * w % MOD
        length <<= 1

    if invert:
        inv_n = power_mod(n, MOD - 2, MOD)
        for i in range(n):
            a[i] = a[i] * inv_n % MOD

def poly_multiply_ntt(a, b):
    """Exact polynomial multiplication using NTT."""
    result_len = len(a) + len(b) - 1
    n = 1
    while n < result_len: n <<= 1
    fa = a + [0] * (n - len(a))
    fb = b + [0] * (n - len(b))
    ntt(fa); ntt(fb)
    fc = [fa[i] * fb[i] % MOD for i in range(n)]
    ntt(fc, invert=True)
    return fc[:result_len]

# Test: (1 + 2x + 3x²)(2 + x) = 2 + 5x + 8x² + 3x³
print(poly_multiply_ntt([1,2,3], [2,1]))  # [2, 5, 8, 3]

This NTT implementation is the standard template for competitive programming. It handles polynomials with coefficients up to ~10⁹ and lengths up to 2²³ ≈ 8 million. The mod 998244353 is chosen specifically because its totient has a large power-of-2 factor, enabling the full radix-2 FFT decomposition.

NTT-friendly primes. Not every prime works for NTT. You need p such that p - 1 has a large power-of-2 factor (to support arrays up to that power of 2). Common choices in competitive programming:

998244353 = 2²³ · 119 + 1 (primitive root: 3, max n = 2²³)
985661441 = 2²³ · 117 + 1 (primitive root: 3, max n = 2²³)
469762049 = 2²⁶ · 7 + 1 (primitive root: 3, max n = 2²⁶)

If you need results mod a different prime p', use NTT with multiple NTT-friendly primes and combine using the Chinese Remainder Theorem (CRT).

When to Use NTT vs Complex FFT

Situation	Use	Why
Coefficients are real/complex floats	Complex FFT	Natural domain, simpler code
Exact integer results needed	NTT or complex FFT + rounding	Complex FFT works for small ints (<10¹⁵)
Result needed mod a prime	NTT	Exact, no floating-point
Competitive programming	NTT (mod 998244353)	Exact, fast, standard template
Signal/image processing	Complex FFT	Data is naturally floating-point
Post-quantum cryptography	NTT	Security requires exact modular arithmetic

Variants and Extensions

Variant	What Changes	When to Use
Cooley-Tukey FFT	Standard radix-2 D&C (this lesson)	General purpose, n = power of 2
Mixed-Radix FFT	Splits into radix-3, radix-5, etc.	n not a power of 2 (e.g., n = 12 = 4×3)
Bluestein's FFT	Reduces arbitrary-n DFT to convolution	n is prime (no small factors)
NTT	Roots of unity in Z/pZ, not C	Exact arithmetic, competitive programming, crypto
2D FFT	Apply 1D FFT to rows, then columns	Image processing, 2D convolution
STFT	FFT on overlapping windows	Time-frequency analysis (spectrograms)
DCT	Real-valued cousin, cosine basis	JPEG, MP3, video compression

Historical Context

The Cooley-Tukey paper (1965) is often cited as the discovery of the FFT. But Gauss described essentially the same algorithm in 1805 — 160 years earlier — while computing the orbits of asteroids. He never published it as a standalone result. The lesson: important algorithms are often discovered, forgotten, and rediscovered.

The impact of Cooley-Tukey was immediate and enormous. Before 1965, computing a DFT of size 1024 required ~1 million operations. After, it required ~10,000. Entire fields — digital signal processing, computational physics, medical imaging — became practical overnight. The algorithm has been called "the most important numerical algorithm of our lifetime" by Gilbert Strang and "one of the great computational tools of our century" by the American Mathematical Society.

What We Built

Let us step back and admire the full construction. We started with a simple problem — multiply two polynomials — and arrived at one of the deepest algorithms in computer science.

Representation duality

Coefficient form (multiply is hard, add is easy) vs point-value form (multiply is easy, evaluate is hard).

↓

Roots of unity

Special evaluation points with recursive structure (halving lemma) and orthogonality (summation lemma).

↓

Divide & conquer

Split even/odd coefficients. Evaluation points square to half-size set. T(n) = 2T(n/2) + O(n).

↓

Butterfly combines

ω^k+n/2 = -ω^k gives two outputs per multiplication. O(n log n) total.

↓

Inverse = same algorithm

Conjugate roots + divide by n. Reuses all FFT code.

↓

Convolution theorem

Unlocks signal processing, image processing, big integer multiplication, and hundreds of other applications.

Every step builds on the previous one. Remove any link in the chain and the algorithm does not work. This is what makes the FFT one of the most elegant constructions in all of algorithm design.

The FFT is not just an algorithm — it is a way of thinking. The idea that changing representation can make hard problems easy, that structure in evaluation points enables divide and conquer, that the same algorithm works forward and backward — these ideas appear again and again across mathematics, physics, and computer science. Mastering the FFT gives you a template for recognizing and exploiting structure in computation.

FFT Pipeline: The Complete Picture

The full polynomial multiplication pipeline visualized. Input coefficients → FFT → pointwise multiply → IFFT → output coefficients. Each box shows the data at that stage.

Closing thought. "The FFT is the most important numerical algorithm of our lifetime." — Gilbert Strang, MIT. It converts O(n²) into O(n log n) for one of the most universal operations in computing. Every time you make a phone call, stream music, take a photo, or ask a voice assistant a question, the FFT is running — millions of times per second, invisibly, making the modern world possible.

Concept check: The Number Theoretic Transform (NTT) is a variant of the FFT that works in modular arithmetic. What advantage does it have over the standard complex-valued FFT?

It is asymptotically faster (O(n) instead of O(n log n)) It avoids floating-point errors, giving exact results It works on non-power-of-2 input sizes It does not require roots of unity