The Polar Express

Chapter 0: The Problem

You are training a large language model. At every step, you compute a gradient matrix G — this tells you which direction to nudge the weights to reduce loss. Standard SGD just subtracts this gradient (scaled by a learning rate) from the weights. Adam and AdamW are smarter: they rescale each parameter individually by dividing by a running estimate of the gradient's magnitude.

But there is a newer optimizer that has been breaking records. It is called Muon (MomentUm Orthogonalized by Newton-Schulz). Muon does something strange: instead of stepping in the direction of the gradient, it steps in the direction of the gradient's polar factor — the closest semi-orthogonal matrix to the gradient.

Why? Because the polar factor gives the steepest descent direction with respect to the spectral norm rather than the Frobenius norm. This single change has let Muon outperform Adam on the NanoGPT speedrun, train 1 trillion parameter models, and push the Pareto frontier of performance-vs-FLOPs for large language models.

The bottleneck: Computing the polar factor exactly requires a full Singular Value Decomposition (SVD), which costs O(mn·min(m,n)) and is catastrophically slow on GPUs. SVD is inherently sequential — it cannot be efficiently parallelized into the massive matrix-multiply operations that GPUs are built for. In practice, using the exact SVD doubles the wall-clock time of every training step.

So Muon uses an approximation. The standard one is Newton-Schulz iteration: apply a fixed polynomial to the matrix repeatedly, and the singular values converge toward 1 (which is what the polar factor does — it replaces every singular value with 1). Newton-Schulz uses only matrix-matrix multiplications, so it flies on GPUs.

But Newton-Schulz has a crippling weakness: it is glacially slow at the start. If some singular values are much smaller than others (which they are in real gradient matrices), Newton-Schulz makes almost no progress for the first 15+ iterations. And in deep learning, you can only afford 5 or 6 iterations per training step.

Heuristic alternatives exist. Keller Jordan proposed a hand-tuned polynomial that converges fast initially but never reaches high accuracy — it plateaus at an error of roughly 0.3 and stays there forever. Cesista, You, and Jordan proposed a 6-step sequence of different polynomials found by heuristic search, which is better but still fails to converge.

The question this paper answers: Is there a provably optimal choice of polynomials — one that converges as fast as possible at every single iteration, both early (when accuracy matters most for training) and late (when you need convergence guarantees)?

The simulation below shows the core problem. Each method applies a degree-5 polynomial at every iteration. Watch how Newton-Schulz flatlines for the first 15 iterations, while Jordan's method races ahead but then gets stuck. Can we do better than both?

The Convergence Problem

Spectral-norm error of three methods approximating sign(x) on a matrix with singular values between 10^-6 and 1. All use degree-5 polynomials. Click "Run" to watch them iterate.

What is the core practical problem with using Newton-Schulz in Muon?

It requires matrix inversions, which are slow on GPUs It converges very slowly in the first few iterations when singular values span a wide range, and deep learning can only afford 5-6 iterations It does not work with bfloat16 precision

Chapter 1: The Key Insight

Newton-Schulz applies the same polynomial at every iteration. Jordan and You apply different polynomials at each iteration but find them by heuristic search — trial and error with no optimality guarantee. The authors of this paper asked: what if we could find the mathematically optimal polynomial for each iteration?

Here is the beautiful result: greedy is optimal. At each iteration, you have a current interval of singular values [ℓ, u]. You pick the degree-d odd polynomial that maps this interval as close to 1 as possible. This greedy choice — optimize one step at a time, ignoring the future — turns out to produce the globally optimal composition of polynomials. Not just "pretty good." Provably the best possible.

Why greedy works here (and almost never works elsewhere): In most optimization problems, greedy choices are suboptimal because early decisions constrain later ones. Here, the structure is different. Each polynomial maps the interval [ℓ_t, u_t] to a new interval [ℓ_t+1, u_t+1]. The next polynomial's job depends only on this new interval, not on the history of how we got there. The key theorem (Theorem 3.1) proves that optimizing each step independently gives the same result as optimizing the entire composition jointly. The composed polynomial minimizes worst-case error over all possible compositions of odd polynomials of the same degree.

Furthermore, finding the optimal polynomial for a single step reduces to a classical problem in approximation theory: uniform approximation. This problem was studied by Chebyshev in the 1850s. The solution is well-known and has a beautiful characterization called the Equioscillation Theorem: the optimal approximation is the unique polynomial whose error oscillates equally between its maximum and minimum values.

The intellectual genealogy: This paper sits at the intersection of three fields that rarely talk to each other: (1) Numerical linear algebra (polar decomposition, matrix sign function), (2) Approximation theory (Chebyshev polynomials, Remez algorithm, equioscillation), and (3) Deep learning optimization (Muon, NanoGPT speedruns). The authors probably discovered the connection by noticing that Newton-Schulz's polynomial p(x) = (3x - x³)/2 is the scalar sign iteration — and sign(x) is the function 1 on [ℓ, u] that we are trying to approximate. Once you see it as an approximation theory problem, a century of mathematical machinery becomes available.

The result is Polar Express: an algorithm that precomputes optimal polynomials offline (in float64) and applies them online (in bfloat16). Five iterations of Polar Express consistently outperform all prior methods — Newton-Schulz, Jordan's heuristic, You's 6-step heuristic — at every iteration count, on both synthetic and real gradient matrices.

Offline (once, float64)

For each iteration t = 1, ..., T: solve for the optimal degree-5 polynomial p_t that maps [ℓ_t, u_t] as close to 1 as possible. Store the three coefficients (a, b, c) per iteration.

↓

Online (every training step, bfloat16)

Normalize: X₀ = M / (||M||_F + 10^-2). Then apply each precomputed polynomial: X_t = a·X_t-1 + b·X_t-1(X_t-1^TX_t-1) + c·X_t-1(X_t-1^TX_t-1)². Five iterations of three matrix-multiplies each.

↓

Result

X_T ≈ polar(M) = UV^T. Use as the descent direction for Muon weight update: W_t+1 = W_t - λ·X_T.

Why does a greedy approach — optimizing one polynomial at a time — produce the globally optimal composition?

Because each polynomial's optimal choice depends only on the current interval [ℓ_t, u_t], not on previous choices, so independent optimization equals joint optimization Because greedy approaches always work for polynomial problems Because the authors exhaustively search all possible compositions

Chapter 2: Muon & the Polar Decomposition

Before we can understand why Polar Express matters, we need to understand the machine it powers. Let's trace the data flow of a single Muon update step.

The Muon update rule

Let W_t ∈ ℝ^m×n be the weight matrix of a layer at iteration t. Let G_t ∈ ℝ^m×n be the stochastic gradient. Muon maintains a running momentum estimate M_t:

M_t = β M_t-1 + (1 - β) G_t

where β = 0.9 by default. Then the weight update is:

W_t+1 = W_t - λ · polar(M_t)

Compare this to SGD with momentum, which does W_t+1 = W_t - λ · M_t. The only difference is that Muon replaces M_t with polar(M_t).

What is polar(M)?

Every matrix M has a Singular Value Decomposition (SVD): M = U Σ V^T, where U and V are orthogonal matrices and Σ is diagonal with non-negative entries σ₁ ≥ σ₂ ≥ ... ≥ σ_r > 0 (the singular values).

The polar factor is defined as:

polar(M) := U V^T

In other words: take the SVD, throw away the singular values, keep just the "rotation" part. This gives you the closest semi-orthogonal matrix to M — the matrix that points in the same "direction" as M but has all singular values equal to 1.

Why this helps optimization: Standard SGD steps in the direction -M_t, which is the steepest descent direction with respect to the Frobenius norm (sum of squared entries). But Muon steps in the direction -polar(M_t), which is the steepest descent direction with respect to the spectral norm (largest singular value). The spectral norm treats all singular directions more equally — it does not let a few large singular values dominate the update. Think of it as a more democratic gradient: every direction gets equal say, regardless of magnitude.

Why not just compute the SVD?

In theory, you could compute polar(M) by running torch.linalg.svd(M) and returning U @ V.T. The problem is runtime. For a gradient matrix of shape (1280, 1280) in a GPT-2-Large layer, the SVD takes roughly 2x the time of the forward+backward pass combined. And you need to compute polar(M) for every weight matrix at every training step.

The ablation experiments in this paper (Figure 5) confirm the damage: using the exact SVD in Muon doubles the median training step time from ~2 seconds to ~4 seconds. And the kicker? It doesn't even improve validation loss compared to a 5-iteration polynomial approximation. The extra precision is wasted.

Data flow summary: G_t ∈ ℝ^m×n (gradient, bfloat16) → M_t ∈ ℝ^m×n (momentum, bfloat16) → normalize by Frobenius norm → apply 5 polynomial iterations (each: 3 matrix multiplies in bfloat16) → X₅ ≈ polar(M_t) ∈ ℝ^m×n → weight update. Total cost: 15 matrix-matrix multiplies per layer per step. No SVD. No matrix inverse. No QR decomposition. Just multiplies.

The interactive simulation below lets you see what the polar factor does to a matrix. Drag the slider to change the condition number (ratio of largest to smallest singular value). Watch how polar(M) "normalizes" the singular values to 1 while preserving the rotation.

Polar Decomposition Visualizer

A 2×2 matrix M = UΣV^T transforms the unit circle into an ellipse. polar(M) = UV^T maps the circle to a (rotated) circle. Adjust the condition number to see how different σ₁/σ₂ ratios affect the geometry.

σ₁/σ₂ 4.0

Rotation θ 0.45

What does polar(M) = UV^T do to the singular values of M?

It squares them It replaces all of them with 1, preserving only the rotational part U and V It divides each by the largest singular value

Chapter 3: Newton-Schulz & Its Limits

Now that we know what polar(M) is and why we need to compute it cheaply, let's understand the existing workhorse: the Newton-Schulz iteration.

The scalar sign function

Forget matrices for a moment. Consider a scalar x ∈ (0, 1]. We want to map it to 1 (since polar decomposition replaces each singular value with 1). The cubic Newton-Schulz polynomial is:

p(x) = (3x - x³) / 2

Let's check: p(1) = (3 - 1)/2 = 1. Good, 1 is a fixed point. What about x = 0.5? p(0.5) = (1.5 - 0.125)/2 = 0.6875. Apply again: p(0.6875) ≈ 0.8999. Again: p(0.8999) ≈ 0.9985. It's converging to 1! But what about x = 0.01? p(0.01) ≈ 0.015. p(0.015) ≈ 0.0225. After 5 iterations, we're at about 0.076. Barely moved. This is the problem.

The degree-5 variant

In practice, Muon uses the degree-5 Newton-Schulz:

p(x) = (15x - 10x³ + 3x⁵) / 8

This converges faster (cubically instead of quadratically near x = 1), but still suffers the same slow-start problem for small x. Let's verify with a worked example:

Iteration	x (start = 0.01)	x (start = 0.5)	x (start = 0.99)
0	0.0100	0.5000	0.9900
1	0.0188	0.8984	1.0000
2	0.0352	1.0000	1.0000
3	0.0659	1.0000	1.0000
4	0.1233	1.0000	1.0000
5	0.2293	1.0000	1.0000

After 5 iterations starting from x = 0.01, we're only at 0.23 — still far from 1. The error is 1 - 0.23 = 0.77. But starting from x = 0.5, we converge in just 2 iterations. Newton-Schulz is great when you're already close to 1, and terrible when you're far away.

Why the slow start happens: Newton-Schulz's polynomial is designed to have a super-attractive fixed point at x = 1. Near x = 1, the derivative |p'(1)| = 0, so convergence is faster than exponential. But near x = 0, the derivative is |p'(0)| = 15/8 ≈ 1.875. This means each iteration multiplies x by roughly 1.875 — only 87.5% growth per step. For a singular value that starts at 10^-6, it takes about 16 iterations just to reach x = 0.1. And real gradient matrices in neural networks have singular values spanning 6+ orders of magnitude.

Jordan's heuristic

Keller Jordan realized the problem and proposed a different degree-5 polynomial found by numerical search:

p(x) = 3.4445x - 4.7750x³ + 2.0315x⁵

This polynomial has a much larger derivative near x = 0 (p'(0) = 3.4445), so small singular values grow faster in the early iterations. But there's a catch: this polynomial does not have 1 as a stable fixed point. Instead, it has a fixed point around 0.7. So it converges rapidly to a neighborhood of 1 but never gets closer than about 0.3 error. It plateaus.

You's 6-step method

Cesista, You, and Jordan improved on this by using six different polynomials in sequence, each found by heuristic search. This reaches better accuracy than Jordan's single polynomial, but still fails to converge to zero error. Both heuristic methods are ad hoc — there is no guarantee they are anywhere near optimal.

The fundamental tradeoff Newton-Schulz faces: A degree-5 odd polynomial has three free coefficients (a, b, c for ax + bx³ + cx⁵). Newton-Schulz uses coefficients designed to maximize the order of convergence near x = 1 (super-exponential). Jordan uses coefficients designed to maximize growth near x = 0 (fast initial progress). You can't have both with a single fixed polynomial. The breakthrough is: you don't have to use the same polynomial every time.

The simulation below lets you trace how a single scalar value evolves under each method. Try starting values near 0 and near 1 to see the tradeoff in action.

Scalar Iteration Tracer

Pick a starting value x₀ and watch how each method maps it toward 1. Green = Newton-Schulz, red = Jordan, blue = Polar Express (adaptive).

x₀ 0.05

Iterations 6

Why does Newton-Schulz converge slowly for small singular values but fast for values near 1?

Its polynomial has derivative ≈ 1.875 near zero (barely growing) but derivative = 0 at x = 1 (super-exponential convergence) It only works for singular values greater than 0.5 The bfloat16 precision loses small values

Chapter 4: Approximation Theory

We now arrive at the mathematical heart of the paper. The trick is to reframe the polar decomposition problem as a polynomial approximation problem, and then invoke 170 years of approximation theory to solve it optimally.

From matrix problem to scalar problem

Recall: polar(M) = UV^T, and we are constructing X_T = p(M) = Up(Σ)V^T for some composed polynomial p = p_T ∘ ... ∘ p₁. The spectral norm error is:

||polar(M) - p(M)||₂ = ||U(I - p(Σ))V^T||₂ = max_i |1 - p(σ_i)|

The last step uses the fact that U and V are orthogonal (they don't change the spectral norm). So the matrix problem reduces to a scalar problem: find the polynomial p that makes |1 - p(x)| as small as possible for all x in the interval [ℓ, u] where the singular values live.

Odd polynomials only

There is a constraint: each p_t must be an odd polynomial (only odd-degree monomials: x, x³, x⁵, ...). Why? Because for a rectangular matrix M = UΣV^T, we can compute odd monomials of Σ using the formula:

M^2q+1 = U Σ^2q+1 V^T = M(M^TM)^q

This only involves matrix-matrix products — no SVD needed. But even-degree monomials like M² = UΣ²V^T ≠ (M^TM) in general (for non-square M), so we can't compute them without the SVD. We are therefore restricted to odd polynomials.

The optimization problem

Given degree d (odd), lower bound ℓ, upper bound u on the singular values, and T iterations, we seek:

p* = argmin_{p = p_T ∘ ... ∘ p₁} max_{x ∈ [ℓ, u]} |1 - p(x)|

where each p_t is an odd polynomial of degree at most d. This is a minimax problem: minimize the maximum error.

Compositions are powerful

Why compositions? Because they are computationally cheap. A single polynomial of degree d^T would require Θ(d^T/2) matrix-matrix products to evaluate (using the Paterson-Stockmeyer method). But a composition of T polynomials each of degree d requires only O(Td) products. For d = 5 and T = 5, that's 25 matrix products for a polynomial of degree 5⁵ = 3125, versus √3125 ≈ 56 products for a single polynomial of the same degree. Compositions win.

What the paper proves the search reduces to: By unitary invariance, polar(M) - p(M) = U(I - p(Σ))V^T, and ||·||₂ is unitarily invariant. So the matrix optimization problem over all M with σ(M) ⊂ [ℓ, u] reduces to a scalar optimization: find the composition p = p_T ∘ ... ∘ p₁ of odd polynomials that minimizes max_{x ∈ [ℓ, u]} |1 - p(x)|. This is a problem that Chebyshev, Remez, and Zolotarev studied extensively. We are standing on the shoulders of giants.

Worked example: the degree-3 case

Let's solve the simplest case by hand: d = 3. We have p(x) = ax + bx³, and we want to minimize max_{x ∈ [ℓ, u]} |1 - p(x)|. For d = 3, the odd polynomial has (3+1)/2 = 2 coefficients (a and b), so it's a member of a 2-dimensional space. By the Equioscillation Theorem (which we'll cover in the next chapter), the optimal polynomial has exactly 2 + 1 = 3 equioscillation points in [ℓ, u] where |1 - p(x)| attains its maximum with alternating signs.

For ℓ = 0.1 and u = 1, the optimal degree-3 polynomial turns out to be approximately p(x) ≈ 2.48x - 1.49x³. Let's check:

x	p(x)	1 - p(x)
0.10	0.247	+0.753
0.33	0.765	+0.235
0.65	1.203	-0.203
1.00	0.990	+0.010

Compare to Newton-Schulz degree-3 [p(x) = (3x-x³)/2]: at x = 0.1, p(0.1) = 0.1495, giving error 0.8505. The optimal polynomial's error at x = 0.1 is only 0.753 — already better. The optimal polynomial sacrifices some accuracy near x = 1 to gain much more at x = 0.1.

Polynomial Comparison: Newton-Schulz vs Optimal

Plot of p(x) for Newton-Schulz (green) vs optimal (orange) degree-5 polynomial on [ℓ, 1]. The target is y = 1 (dashed line). Adjust ℓ to see how the optimal polynomial adapts.

Lower bound ℓ 0.03

Why must the polynomials applied to M be odd (only odd-degree monomials)?

Because even polynomials always evaluate to zero Because the Equioscillation Theorem only applies to odd polynomials Because for rectangular matrices, odd monomials M^2q+1 = M(M^TM)^q can be computed with matrix-matrix products, while even monomials require the SVD

Chapter 5: Greedy is Optimal

This chapter covers the paper's central theorem. We will derive why optimizing each polynomial greedily — step by step — produces the globally optimal composition.

Setting up the greedy procedure

Define ℓ₁ = ℓ and u₁ = u (the initial bounds on singular values). At iteration t, we have current bounds [ℓ_t, u_t]. We choose p_t to minimize:

p_t = argmin_{p ∈ P_d^odd} max_{x ∈ [ℓ_t, u_t]} |1 - p(x)|

After applying p_t, the singular values that were in [ℓ_t, u_t] are now in a new interval [ℓ_t+1, u_t+1], where:

ℓ_t+1 = min_{x ∈ [ℓ_t, u_t]} p_t(x) u_t+1 = max_{x ∈ [ℓ_t, u_t]} p_t(x)

The key simplification

Here is a beautiful property of the optimal polynomial (proven as Lemma C.2 in the paper): for the optimal p_t, the minimum is always attained at x = ℓ_t and the maximum error always equals 1 - p_t(ℓ_t). Furthermore, u_t+1 = 2 - ℓ_t+1. So the entire evolution is determined by tracking just one number: the lower bound ℓ_t.

ℓ_t+1 = p_t(ℓ_t), u_t+1 = 2 - ℓ_t+1, error_t = 1 - ℓ_t+1

The error at iteration t equals 1 - ℓ_t+1. So the error shrinks as ℓ_t approaches 1. This is the interval shrinkage that the greedy algorithm exploits: each polynomial pushes the interval [ℓ_t, u_t] closer to the single point {1}.

Theorem 3.1 (Greedy optimality)

The composition p* = p_T ∘ ... ∘ p₁ constructed greedily is optimal among ALL compositions of T odd polynomials of degree d. The worst-case error is:

max_{x ∈ [ℓ, u]} |1 - p*(x)| = 1 - ℓ_T+1

Proof sketch (the key idea): The proof works by induction on T. The base case T = 1 is just the standard Chebyshev/Remez optimality result. For the inductive step: suppose the greedy composition of T-1 polynomials p_T-1 ∘ ... ∘ p₁ maps [ℓ, u] to [ℓ_T, u_T]. Any composition of T polynomials can be written as p_T ∘ q, where q is a composition of T-1 polynomials. The image of [ℓ, u] under q is some interval [a, b] ⊇ [ℓ_T, u_T] (by inductive optimality of the greedy choice). Since [a, b] ⊇ [ℓ_T, u_T], the optimal polynomial on [a, b] can do no better than the optimal polynomial on [ℓ_T, u_T]. So the greedy choice at every step is globally optimal.

Convergence rate: super-exponential

The paper shows (Theorem 3.3) that the error satisfies:

||polar(M) - X_T||₂ ≤ |1 - ℓ²|^{(q+1)^T}

where d = 2q + 1. For d = 5 (q = 2), this is cubic convergence: the exponent triples with each iteration. For d = 3 (q = 1), quadratic convergence. This is the same asymptotic rate as Newton-Schulz — but Polar Express also dominates in the initial iterations, giving the best of both worlds.

Worked example: tracking the interval evolution

Let ℓ = 0.03, u = 1, d = 5. The greedy algorithm produces these intervals:

t	ℓ_t	u_t	Error = 1 - ℓ_t+1
1	0.0300	1.000	0.569
2	0.431	1.569	0.068
3	0.932	1.068	3.0 × 10^-4
4	0.9997	1.0003	2.6 × 10^-11
5	≈1	≈1	≈0

After just 4 iterations, the error is at 10^-11. By contrast, Newton-Schulz starting from ℓ = 0.03 has error ≈ 0.95 after 4 iterations. The early iterations of Polar Express do the heavy lifting.

Interval Evolution Visualizer

Watch how the greedy optimal polynomials progressively squeeze the interval [ℓ_t, u_t] toward the single point {1}. Each panel shows one iteration's polynomial mapping the current interval closer to 1.

Initial ℓ 0.03

Iteration 0

In the greedy algorithm, what determines the new interval bounds [ℓ_t+1, u_t+1] after applying the optimal polynomial p_t?

They are always halved: ℓ_t+1 = ℓ_t/2 ℓ_t+1 = p_t(ℓ_t) and u_t+1 = 2 - ℓ_t+1; the entire evolution is tracked by just the lower bound They are determined by evaluating p_t at 100 random points

Chapter 6: The Polar Express Algorithm

We now have all the pieces. Let's assemble the actual algorithm, including the finite-precision tricks that make it work in bfloat16.

Finding each optimal polynomial (degree 5)

At each iteration, we need to solve: find the odd degree-5 polynomial p(x) = ax + bx³ + cx⁵ that minimizes max_{x ∈ [ℓ, u]} |1 - p(x)|. For degree 5, the paper provides Algorithm 2 — a clean iterative procedure equivalent to the Remez algorithm but specialized to this simple case.

The Equioscillation Theorem says the optimal polynomial makes the error |1 - p(x)| equioscillate — it reaches its maximum with alternating signs at exactly (5+1)/2 + 1 = 4 points in [ℓ, u]. The two interior equioscillation points q and r are determined by the condition that the error values at ℓ, q, r, u alternate in sign and are equal in magnitude.

The algorithm solves a 4×4 linear system at each iteration until q and r converge. Here is the core loop (simplified from Implementation 2):

python
def optimal_quintic(l, u):
    # Find q, r (interior equioscillation points)
    q = (3*l + u) / 4
    r = (l + 3*u) / 4
    while True:
        # Solve 4x4 system: p(l)=1-E, p(q)=1+E, p(r)=1-E, p(u)=1+E
        LHS = np.array([
            [l, l**3, l**5,  1],
            [q, q**3, q**5, -1],
            [r, r**3, r**5,  1],
            [u, u**3, u**5, -1],
        ])
        a, b, c, E = np.linalg.solve(LHS, np.ones(4))
        # Update q, r to the actual equioscillation points
        q, r = np.sqrt((-3*b + np.array([-1, 1]) *
                    sqrt(9*b**2 - 20*a*c)) / (10*c))
        # Repeat until converged
    return a, b, c

Precomputed coefficients for ℓ = 10^-3

When you run this for T = 6 iterations starting from ℓ = 10^-3, u = 1, you get these coefficients (after applying the safety factor of 1.01):

Iter	a	b	c
1	8.287	-23.596	17.300
2	4.107	-2.947	0.545
3	3.949	-2.909	0.552
4	3.318	-2.488	0.510
5	2.301	-1.669	0.419
6+	1.875	-1.250	0.375

Notice how the coefficients evolve. Iteration 1 has large coefficients (a = 8.3) to aggressively boost small singular values. By iteration 5, the coefficients approach Newton-Schulz (15/8, -10/8, 3/8) because once the singular values are near 1, Newton-Schulz's super-exponential convergence is optimal. Polar Express automatically transitions from aggressive early behavior to Newton-Schulz's polished late behavior.

The complete online algorithm (Implementation 1):
(1) Normalize: X₀ = M / (||M||_F + 10^-2), in bfloat16.
(2) If m > n, transpose: X = X^T (reduces FLOPs for tall matrices).
(3) For t = 1, ..., T: compute A = X^TX, B = b·A + c·A², X = a·X + X·B.
(4) If transposed, transpose back.
That's it. Three matrix multiplies per iteration. No branching, no conditionals. Pure GEMM.

Finite-precision considerations

Two issues arise when running in bfloat16:

Issue 1: Round-off can exceed u_t. Numerical errors can create singular values slightly larger than the upper bound u_t. If the polynomial overshoots for these values, the iteration can diverge. Fix: replace each p_t(x) by p_t(x/1.01), effectively widening the interval by 1%. This costs almost nothing in accuracy.

Issue 2: Non-monotonicity in early iterations. The optimal polynomial p₁ has strong oscillations (from the equioscillation property). In the first iteration, these oscillations can cause the lower bound ℓ₂ to be slightly negative in finite precision, which breaks the iteration. Fix: use a slightly suboptimal (but more monotone) polynomial in the first iteration by solving the minimax problem on [max(ℓ_t, u_t/10), u_t] instead of [ℓ_t, u_t]. This cushions the lower bound at a negligible cost.

The design decision that matters: The Frobenius norm normalization X₀ = M/||M||_F ensures u ≈ 1 but gives a loose lower bound ℓ. The authors set ℓ = 10^-3 as a guess for all gradient matrices. If the true σ_min is much smaller, convergence is delayed by a few iterations but never breaks. If σ_min is larger than 10^-3, convergence is slightly slower than if we knew the true ℓ, but still faster than alternatives. Robustness to a bad ℓ is confirmed experimentally (Figure 3): even 2 orders of magnitude off, Polar Express remains competitive.

python — Implementation 1 (complete)
from itertools import repeat
import torch

coeffs_list = [
    (8.28721, -23.59589, 17.30039),
    (4.10706, -2.94748, 0.54484),
    (3.94870, -2.90890, 0.55182),
    (3.31842, -2.48849, 0.51005),
    (2.30065, -1.66890, 0.41888),
    (1.89130, -1.26800, 0.37680),
    (1.87500, -1.25000, 0.37500),  # Newton-Schulz
]
# Safety factor for bfloat16 stability
coeffs_list = [(a/1.01, b/1.01**3, c/1.01**5)
               for (a,b,c) in coeffs_list[:-1]] + [coeffs_list[-1]]

@torch.compile
def PolarExpress(G, steps=5):
    X = G.bfloat16()
    if G.size(-2) > G.size(-1): X = X.mT
    X = X / (X.norm(dim=(-2,-1), keepdim=True) * 1.01 + 1e-7)
    hs = coeffs_list[:steps] + list(
        repeat(coeffs_list[-1], steps - len(coeffs_list)))
    for a, b, c in hs:
        A = X @ X.mT
        B = b * A + c * A @ A
        X = a * X + B @ X   # X <- aX + bX^3 + cX^5
    if G.size(-2) > G.size(-1): X = X.mT
    return X

Why does Polar Express use different polynomial coefficients for the first few iterations but converge to Newton-Schulz coefficients for later iterations?

Because early iterations need large coefficients to aggressively boost small singular values, while later iterations (where values are near 1) benefit from Newton-Schulz's super-exponential convergence Because the Remez algorithm fails for intervals close to [1, 1] Because bfloat16 precision cannot handle large coefficients

Chapter 7: Convergence Explorer

This is the showcase chapter. The interactive simulation below reconstructs Figure 3 from the paper — the convergence comparison of all methods — and lets you explore it with full control.

You control the condition number (ratio of largest to smallest singular value), the number of iterations, and the lower bound ℓ used by Polar Express. Watch how each method converges on a log scale. Pay attention to two regimes: the early iterations (where Polar Express dominates due to its adaptive polynomials) and the late iterations (where Polar Express matches or exceeds Newton-Schulz's asymptotic rate).

Convergence Race: All Methods

Spectral-norm error ||polar(M) - X_t||₂ vs iteration, for a matrix with singular values log-spaced between σ_min and 1. All methods use degree-5 polynomials. Toggle methods on/off with checkboxes.

σ_min 10^-6

Polar Express ℓ 10^-6

Max iterations 20

Newton-Schulz Jordan You (6-step) Polar Express

What to look for in the simulation:
1. Set σ_min = 10^-6 (default). Newton-Schulz (green) flatlines until iteration 17. Jordan (red) plateaus at ~0.3. Polar Express (orange) reaches 10^-3 by iteration 11.
2. Set σ_min = 10^-1 (close to 1). Now Newton-Schulz and Polar Express converge at similar rates. The adaptive polynomials automatically approach Newton-Schulz when the interval is already narrow.
3. Try setting Polar Express ℓ = 10^-2 while σ_min = 10^-6. Convergence is delayed by ~3 iterations but eventually catches up. The method is robust to misspecified ℓ.

Breaking the method: Try setting ℓ to something wildly wrong — like 10^-1 when σ_min = 10^-8. Polar Express won't diverge (it can't, since each polynomial is bounded on [0, 2]), but it won't make good progress until the actual singular values enter the interval where its polynomial is well-calibrated. The method degrades gracefully: bad ℓ = slow convergence. Never crash. This is a deliberate engineering choice.

When σ_min is very small (10^-6) and you use only 5 iterations, which methods reach practical accuracy (< 0.1 error)?

All four methods reach error < 0.1 in 5 iterations Only Newton-Schulz, because it has proven convergence None of the fixed-polynomial methods (Newton-Schulz, Jordan) do; only Polar Express with a well-tuned ℓ gets close

Chapter 8: GPT-2 Experiments

Theory is nice; does it actually help train language models? The authors integrate Polar Express into Muon and train GPT-2 models on the FineWeb dataset. The results are remarkably consistent.

Experimental setup

Parameter	GPT-2-Small	GPT-2-Large
n_embd	768	1280
n_layer	12	36
n_head	12	20
Vocab size	50,257
Context length	1024
Training data	FineWeb, 1B tokens (main) / 10B (extended)
Batch size	32
Precision	Mixed (bfloat16)
GPUs	4 × H100
LR schedule	Constant for 40%, then linear decay
Polar iterations	5 (all methods)

All methods use degree-5 polynomials with 5 iterations, so they have exactly the same computational cost. The only difference is which polynomial coefficients they use. This makes the comparison perfectly controlled: any validation loss difference is purely due to the quality of the polar approximation.

GPT-2-Large results (1B tokens)

Method	Best LR	Final Val Loss
AdamW	—	— (baseline)
muon-Jordan	0.02	3.398
muon-You	0.02	3.399
muon-PolarExp	0.02	3.340

Polar Express achieves a validation loss of 3.340, a significant improvement over Jordan (3.398) and You (3.399). The advantage is consistent across all tested learning rates. Not just the best one — every single one.

GPT-2-Small results (1B tokens)

Method	Best LR	Final Val Loss
AdamW	0.0005	4.197
muon-Jordan	0.01	3.639
muon-You	0.01	3.629
muon-PolarExp	0.005	3.588

Why does a better polar approximation improve training? This is the paper's most surprising finding. Using > 6 iterations or even the exact SVD does NOT improve validation loss compared to 5 iterations of Polar Express (Figure 5). So it's not about absolute accuracy. The hypothesis: the first 5 iterations of Polar Express produce a systematically better approximation in the early, high-error regime than Jordan or You. Since every training step uses this approximation, the accumulated advantage over millions of steps is significant. The gradient direction is slightly more correct at every single step.

Scaling to 10B tokens

When extended to 10B tokens (matching the Chinchilla scaling rule for GPT-2-Large), the advantage persists:

Method	Best LR	Final Val Loss (10B tokens)
muon-Jordan	0.002	2.921
muon-You	0.002	2.919
muon-PolarExp	0.002	2.913

The margin narrows at 10B tokens (0.006 vs 0.058 at 1B), but remains consistent and statistically significant. Polar Express is a strict improvement that never hurts.

Ablation: number of iterations

How many iterations do you need? The authors ablated from 2 to 30 iterations plus exact SVD:
- 2-3 iterations: notably worse validation loss.
- 5-6 iterations: optimal. More iterations give no benefit.
- 30 iterations: same loss as 5 iterations. Extra accuracy is wasted.
- Exact SVD: same loss as 5 iterations, but 2× slower per step.
The sweet spot is 5 iterations. The polar approximation only needs to be "good enough" — roughly 0.01 spectral error — and Polar Express reaches this threshold faster than alternatives.

Training Curve Comparison

Simulated validation loss curves for muon-Jordan, muon-You, and muon-PolarExp on GPT-2-Large (1B tokens). Based on the paper's reported results.

Why does using the exact SVD (instead of 5 iterations of Polar Express) NOT improve validation loss?

Because 5 iterations of Polar Express already reach sufficient accuracy (~0.01 error); extra precision is wasted since the stochastic gradient itself is noisy Because the SVD introduces numerical errors in bfloat16 Because Muon does not use the polar decomposition

Chapter 9: Connections

Cheat sheet: key equations

Symbol	Meaning
M = UΣV^T	SVD of momentum matrix M
polar(M) = UV^T	Closest semi-orthogonal matrix to M (replace singular values with 1)
p(x) = ax + bx³ + cx⁵	Degree-5 odd polynomial applied to singular values
[ℓ_t, u_t]	Current interval bounds on singular values at iteration t
p_t = argmin max \|1-p(x)\|	Optimal polynomial for iteration t (Chebyshev/Remez)
ℓ_t+1 = p_t(ℓ_t)	New lower bound = polynomial applied to old lower bound
u_t+1 = 2 - ℓ_t+1	New upper bound determined by equioscillation symmetry
error = 1 - ℓ_T+1	Worst-case approximation error after T iterations

What the paper does NOT solve

Open questions:
1. Adaptive ℓ estimation. The paper uses ℓ = 10^-3 for all matrices. A cheap way to estimate σ_min per-matrix could further improve convergence. Power iteration gives σ_max cheaply; σ_min is harder.
2. Mixed-degree compositions. Theorem 3.1 allows different degrees at each iteration (e.g., degree-7 for iteration 1, degree-3 for later ones). The paper only tests fixed d = 5.
3. Beyond Muon. The polar decomposition appears in Shampoo, SOAP, spectral descent, and Procrustes alignment. Polar Express could accelerate all of these.
4. Scaling to larger models. The experiments go up to GPT-2-Large (774M). Kimi K2 (1T params) uses Muon with Newton-Schulz. Switching to Polar Express there could be impactful.

How to implement it

The complete implementation is 30 lines of Python (Implementation 1 in the paper). It is available at github.com/NoahAmsel/PolarExpress. It requires only PyTorch and is a drop-in replacement for any Newton-Schulz call in a Muon implementation. The coefficients are hardcoded, so there is zero setup cost.

When to use Polar Express: Any time you compute polar(M) or sign(M) using iterative polynomial methods. If you currently use Newton-Schulz with 5-10 iterations, switching to Polar Express's coefficients is a free improvement. No hyperparameter tuning needed. Same computational cost. Strictly better approximation at every iteration.

The paper has been adopted into the NanoGPT speedrun (modded-nanogpt), confirming its practical value in the competitive optimization community.

What makes Polar Express a "free" improvement over Newton-Schulz in existing Muon implementations?

It uses the same operations (matrix-matrix multiplies) with the same computational cost; only the polynomial coefficients change It requires fewer GPU memory It uses a completely different algorithm that is simpler to implement