Amsel, Persson, Musco & Gower — ICLR 2026 (Honorable Mention)

The Polar Express

Optimal polynomial approximations for the polar decomposition, and why they make the Muon optimizer strictly better at training LLMs.

Prerequisites: Matrix multiplication + SVD intuition + What an optimizer does
10
Chapters
8+
Simulations

Chapter 0: The Problem

You are training a large language model. At every step, you compute a gradient matrix G — this tells you which direction to nudge the weights to reduce loss. Standard SGD just subtracts this gradient (scaled by a learning rate) from the weights. Adam and AdamW are smarter: they rescale each parameter individually by dividing by a running estimate of the gradient's magnitude.

But there is a newer optimizer that has been breaking records. It is called Muon (MomentUm Orthogonalized by Newton-Schulz). Muon does something strange: instead of stepping in the direction of the gradient, it steps in the direction of the gradient's polar factor — the closest semi-orthogonal matrix to the gradient.

Why? Because the polar factor gives the steepest descent direction with respect to the spectral norm rather than the Frobenius norm. This single change has let Muon outperform Adam on the NanoGPT speedrun, train 1 trillion parameter models, and push the Pareto frontier of performance-vs-FLOPs for large language models.

The bottleneck: Computing the polar factor exactly requires a full Singular Value Decomposition (SVD), which costs O(mn·min(m,n)) and is catastrophically slow on GPUs. SVD is inherently sequential — it cannot be efficiently parallelized into the massive matrix-multiply operations that GPUs are built for. In practice, using the exact SVD doubles the wall-clock time of every training step.

So Muon uses an approximation. The standard one is Newton-Schulz iteration: apply a fixed polynomial to the matrix repeatedly, and the singular values converge toward 1 (which is what the polar factor does — it replaces every singular value with 1). Newton-Schulz uses only matrix-matrix multiplications, so it flies on GPUs.

But Newton-Schulz has a crippling weakness: it is glacially slow at the start. If some singular values are much smaller than others (which they are in real gradient matrices), Newton-Schulz makes almost no progress for the first 15+ iterations. And in deep learning, you can only afford 5 or 6 iterations per training step.

Heuristic alternatives exist. Keller Jordan proposed a hand-tuned polynomial that converges fast initially but never reaches high accuracy — it plateaus at an error of roughly 0.3 and stays there forever. Cesista, You, and Jordan proposed a 6-step sequence of different polynomials found by heuristic search, which is better but still fails to converge.

The question this paper answers: Is there a provably optimal choice of polynomials — one that converges as fast as possible at every single iteration, both early (when accuracy matters most for training) and late (when you need convergence guarantees)?

The simulation below shows the core problem. Each method applies a degree-5 polynomial at every iteration. Watch how Newton-Schulz flatlines for the first 15 iterations, while Jordan's method races ahead but then gets stuck. Can we do better than both?

The Convergence Problem

Spectral-norm error of three methods approximating sign(x) on a matrix with singular values between 10-6 and 1. All use degree-5 polynomials. Click "Run" to watch them iterate.

What is the core practical problem with using Newton-Schulz in Muon?

Chapter 1: The Key Insight

Newton-Schulz applies the same polynomial at every iteration. Jordan and You apply different polynomials at each iteration but find them by heuristic search — trial and error with no optimality guarantee. The authors of this paper asked: what if we could find the mathematically optimal polynomial for each iteration?

Here is the beautiful result: greedy is optimal. At each iteration, you have a current interval of singular values [ℓ, u]. You pick the degree-d odd polynomial that maps this interval as close to 1 as possible. This greedy choice — optimize one step at a time, ignoring the future — turns out to produce the globally optimal composition of polynomials. Not just "pretty good." Provably the best possible.

Why greedy works here (and almost never works elsewhere): In most optimization problems, greedy choices are suboptimal because early decisions constrain later ones. Here, the structure is different. Each polynomial maps the interval [ℓt, ut] to a new interval [ℓt+1, ut+1]. The next polynomial's job depends only on this new interval, not on the history of how we got there. The key theorem (Theorem 3.1) proves that optimizing each step independently gives the same result as optimizing the entire composition jointly. The composed polynomial minimizes worst-case error over all possible compositions of odd polynomials of the same degree.

Furthermore, finding the optimal polynomial for a single step reduces to a classical problem in approximation theory: uniform approximation. This problem was studied by Chebyshev in the 1850s. The solution is well-known and has a beautiful characterization called the Equioscillation Theorem: the optimal approximation is the unique polynomial whose error oscillates equally between its maximum and minimum values.

The intellectual genealogy: This paper sits at the intersection of three fields that rarely talk to each other: (1) Numerical linear algebra (polar decomposition, matrix sign function), (2) Approximation theory (Chebyshev polynomials, Remez algorithm, equioscillation), and (3) Deep learning optimization (Muon, NanoGPT speedruns). The authors probably discovered the connection by noticing that Newton-Schulz's polynomial p(x) = (3x - x3)/2 is the scalar sign iteration — and sign(x) is the function 1 on [ℓ, u] that we are trying to approximate. Once you see it as an approximation theory problem, a century of mathematical machinery becomes available.

The result is Polar Express: an algorithm that precomputes optimal polynomials offline (in float64) and applies them online (in bfloat16). Five iterations of Polar Express consistently outperform all prior methods — Newton-Schulz, Jordan's heuristic, You's 6-step heuristic — at every iteration count, on both synthetic and real gradient matrices.

Offline (once, float64)
For each iteration t = 1, ..., T: solve for the optimal degree-5 polynomial pt that maps [ℓt, ut] as close to 1 as possible. Store the three coefficients (a, b, c) per iteration.
Online (every training step, bfloat16)
Normalize: X0 = M / (||M||F + 10-2). Then apply each precomputed polynomial: Xt = a·Xt-1 + b·Xt-1(Xt-1TXt-1) + c·Xt-1(Xt-1TXt-1)2. Five iterations of three matrix-multiplies each.
Result
XT ≈ polar(M) = UVT. Use as the descent direction for Muon weight update: Wt+1 = Wt - λ·XT.
Why does a greedy approach — optimizing one polynomial at a time — produce the globally optimal composition?

Chapter 2: Muon & the Polar Decomposition

Before we can understand why Polar Express matters, we need to understand the machine it powers. Let's trace the data flow of a single Muon update step.

The Muon update rule

Let Wt ∈ ℝm×n be the weight matrix of a layer at iteration t. Let Gt ∈ ℝm×n be the stochastic gradient. Muon maintains a running momentum estimate Mt:

Mt = β Mt-1 + (1 - β) Gt

where β = 0.9 by default. Then the weight update is:

Wt+1 = Wt - λ · polar(Mt)

Compare this to SGD with momentum, which does Wt+1 = Wt - λ · Mt. The only difference is that Muon replaces Mt with polar(Mt).

What is polar(M)?

Every matrix M has a Singular Value Decomposition (SVD): M = U Σ VT, where U and V are orthogonal matrices and Σ is diagonal with non-negative entries σ1 ≥ σ2 ≥ ... ≥ σr > 0 (the singular values).

The polar factor is defined as:

polar(M) := U VT

In other words: take the SVD, throw away the singular values, keep just the "rotation" part. This gives you the closest semi-orthogonal matrix to M — the matrix that points in the same "direction" as M but has all singular values equal to 1.

Why this helps optimization: Standard SGD steps in the direction -Mt, which is the steepest descent direction with respect to the Frobenius norm (sum of squared entries). But Muon steps in the direction -polar(Mt), which is the steepest descent direction with respect to the spectral norm (largest singular value). The spectral norm treats all singular directions more equally — it does not let a few large singular values dominate the update. Think of it as a more democratic gradient: every direction gets equal say, regardless of magnitude.

Why not just compute the SVD?

In theory, you could compute polar(M) by running torch.linalg.svd(M) and returning U @ V.T. The problem is runtime. For a gradient matrix of shape (1280, 1280) in a GPT-2-Large layer, the SVD takes roughly 2x the time of the forward+backward pass combined. And you need to compute polar(M) for every weight matrix at every training step.

The ablation experiments in this paper (Figure 5) confirm the damage: using the exact SVD in Muon doubles the median training step time from ~2 seconds to ~4 seconds. And the kicker? It doesn't even improve validation loss compared to a 5-iteration polynomial approximation. The extra precision is wasted.

Data flow summary: Gt ∈ ℝm×n (gradient, bfloat16) → Mt ∈ ℝm×n (momentum, bfloat16) → normalize by Frobenius norm → apply 5 polynomial iterations (each: 3 matrix multiplies in bfloat16) → X5 ≈ polar(Mt) ∈ ℝm×n → weight update. Total cost: 15 matrix-matrix multiplies per layer per step. No SVD. No matrix inverse. No QR decomposition. Just multiplies.

The interactive simulation below lets you see what the polar factor does to a matrix. Drag the slider to change the condition number (ratio of largest to smallest singular value). Watch how polar(M) "normalizes" the singular values to 1 while preserving the rotation.

Polar Decomposition Visualizer

A 2×2 matrix M = UΣVT transforms the unit circle into an ellipse. polar(M) = UVT maps the circle to a (rotated) circle. Adjust the condition number to see how different σ12 ratios affect the geometry.

σ12 4.0
Rotation θ 0.45
What does polar(M) = UVT do to the singular values of M?

Chapter 3: Newton-Schulz & Its Limits

Now that we know what polar(M) is and why we need to compute it cheaply, let's understand the existing workhorse: the Newton-Schulz iteration.

The scalar sign function

Forget matrices for a moment. Consider a scalar x ∈ (0, 1]. We want to map it to 1 (since polar decomposition replaces each singular value with 1). The cubic Newton-Schulz polynomial is:

p(x) = (3x - x3) / 2

Let's check: p(1) = (3 - 1)/2 = 1. Good, 1 is a fixed point. What about x = 0.5? p(0.5) = (1.5 - 0.125)/2 = 0.6875. Apply again: p(0.6875) ≈ 0.8999. Again: p(0.8999) ≈ 0.9985. It's converging to 1! But what about x = 0.01? p(0.01) ≈ 0.015. p(0.015) ≈ 0.0225. After 5 iterations, we're at about 0.076. Barely moved. This is the problem.

The degree-5 variant

In practice, Muon uses the degree-5 Newton-Schulz:

p(x) = (15x - 10x3 + 3x5) / 8

This converges faster (cubically instead of quadratically near x = 1), but still suffers the same slow-start problem for small x. Let's verify with a worked example:

Iterationx (start = 0.01)x (start = 0.5)x (start = 0.99)
00.01000.50000.9900
10.01880.89841.0000
20.03521.00001.0000
30.06591.00001.0000
40.12331.00001.0000
50.22931.00001.0000

After 5 iterations starting from x = 0.01, we're only at 0.23 — still far from 1. The error is 1 - 0.23 = 0.77. But starting from x = 0.5, we converge in just 2 iterations. Newton-Schulz is great when you're already close to 1, and terrible when you're far away.

Why the slow start happens: Newton-Schulz's polynomial is designed to have a super-attractive fixed point at x = 1. Near x = 1, the derivative |p'(1)| = 0, so convergence is faster than exponential. But near x = 0, the derivative is |p'(0)| = 15/8 ≈ 1.875. This means each iteration multiplies x by roughly 1.875 — only 87.5% growth per step. For a singular value that starts at 10-6, it takes about 16 iterations just to reach x = 0.1. And real gradient matrices in neural networks have singular values spanning 6+ orders of magnitude.

Jordan's heuristic

Keller Jordan realized the problem and proposed a different degree-5 polynomial found by numerical search:

p(x) = 3.4445x - 4.7750x3 + 2.0315x5

This polynomial has a much larger derivative near x = 0 (p'(0) = 3.4445), so small singular values grow faster in the early iterations. But there's a catch: this polynomial does not have 1 as a stable fixed point. Instead, it has a fixed point around 0.7. So it converges rapidly to a neighborhood of 1 but never gets closer than about 0.3 error. It plateaus.

You's 6-step method

Cesista, You, and Jordan improved on this by using six different polynomials in sequence, each found by heuristic search. This reaches better accuracy than Jordan's single polynomial, but still fails to converge to zero error. Both heuristic methods are ad hoc — there is no guarantee they are anywhere near optimal.

The fundamental tradeoff Newton-Schulz faces: A degree-5 odd polynomial has three free coefficients (a, b, c for ax + bx3 + cx5). Newton-Schulz uses coefficients designed to maximize the order of convergence near x = 1 (super-exponential). Jordan uses coefficients designed to maximize growth near x = 0 (fast initial progress). You can't have both with a single fixed polynomial. The breakthrough is: you don't have to use the same polynomial every time.

The simulation below lets you trace how a single scalar value evolves under each method. Try starting values near 0 and near 1 to see the tradeoff in action.

Scalar Iteration Tracer

Pick a starting value x0 and watch how each method maps it toward 1. Green = Newton-Schulz, red = Jordan, blue = Polar Express (adaptive).

x0 0.05
Iterations 6
Why does Newton-Schulz converge slowly for small singular values but fast for values near 1?

Chapter 4: Approximation Theory

We now arrive at the mathematical heart of the paper. The trick is to reframe the polar decomposition problem as a polynomial approximation problem, and then invoke 170 years of approximation theory to solve it optimally.

From matrix problem to scalar problem

Recall: polar(M) = UVT, and we are constructing XT = p(M) = Up(Σ)VT for some composed polynomial p = pT ∘ ... ∘ p1. The spectral norm error is:

||polar(M) - p(M)||2 = ||U(I - p(Σ))VT||2 = maxi |1 - p(σi)|

The last step uses the fact that U and V are orthogonal (they don't change the spectral norm). So the matrix problem reduces to a scalar problem: find the polynomial p that makes |1 - p(x)| as small as possible for all x in the interval [ℓ, u] where the singular values live.

Odd polynomials only

There is a constraint: each pt must be an odd polynomial (only odd-degree monomials: x, x3, x5, ...). Why? Because for a rectangular matrix M = UΣVT, we can compute odd monomials of Σ using the formula:

M2q+1 = U Σ2q+1 VT = M(MTM)q

This only involves matrix-matrix products — no SVD needed. But even-degree monomials like M2 = UΣ2VT ≠ (MTM) in general (for non-square M), so we can't compute them without the SVD. We are therefore restricted to odd polynomials.

The optimization problem

Given degree d (odd), lower bound ℓ, upper bound u on the singular values, and T iterations, we seek:

p* = argminp = pT ∘ ... ∘ p1 maxx ∈ [ℓ, u] |1 - p(x)|

where each pt is an odd polynomial of degree at most d. This is a minimax problem: minimize the maximum error.

Compositions are powerful

Why compositions? Because they are computationally cheap. A single polynomial of degree dT would require Θ(dT/2) matrix-matrix products to evaluate (using the Paterson-Stockmeyer method). But a composition of T polynomials each of degree d requires only O(Td) products. For d = 5 and T = 5, that's 25 matrix products for a polynomial of degree 55 = 3125, versus √3125 ≈ 56 products for a single polynomial of the same degree. Compositions win.

What the paper proves the search reduces to: By unitary invariance, polar(M) - p(M) = U(I - p(Σ))VT, and ||·||2 is unitarily invariant. So the matrix optimization problem over all M with σ(M) ⊂ [ℓ, u] reduces to a scalar optimization: find the composition p = pT ∘ ... ∘ p1 of odd polynomials that minimizes maxx ∈ [ℓ, u] |1 - p(x)|. This is a problem that Chebyshev, Remez, and Zolotarev studied extensively. We are standing on the shoulders of giants.

Worked example: the degree-3 case

Let's solve the simplest case by hand: d = 3. We have p(x) = ax + bx3, and we want to minimize maxx ∈ [ℓ, u] |1 - p(x)|. For d = 3, the odd polynomial has (3+1)/2 = 2 coefficients (a and b), so it's a member of a 2-dimensional space. By the Equioscillation Theorem (which we'll cover in the next chapter), the optimal polynomial has exactly 2 + 1 = 3 equioscillation points in [ℓ, u] where |1 - p(x)| attains its maximum with alternating signs.

For ℓ = 0.1 and u = 1, the optimal degree-3 polynomial turns out to be approximately p(x) ≈ 2.48x - 1.49x3. Let's check:

xp(x)1 - p(x)
0.100.247+0.753
0.330.765+0.235
0.651.203-0.203
1.000.990+0.010

Compare to Newton-Schulz degree-3 [p(x) = (3x-x3)/2]: at x = 0.1, p(0.1) = 0.1495, giving error 0.8505. The optimal polynomial's error at x = 0.1 is only 0.753 — already better. The optimal polynomial sacrifices some accuracy near x = 1 to gain much more at x = 0.1.

Polynomial Comparison: Newton-Schulz vs Optimal

Plot of p(x) for Newton-Schulz (green) vs optimal (orange) degree-5 polynomial on [ℓ, 1]. The target is y = 1 (dashed line). Adjust ℓ to see how the optimal polynomial adapts.

Lower bound ℓ 0.03
Why must the polynomials applied to M be odd (only odd-degree monomials)?

Chapter 5: Greedy is Optimal

This chapter covers the paper's central theorem. We will derive why optimizing each polynomial greedily — step by step — produces the globally optimal composition.

Setting up the greedy procedure

Define ℓ1 = ℓ and u1 = u (the initial bounds on singular values). At iteration t, we have current bounds [ℓt, ut]. We choose pt to minimize:

pt = argminp ∈ Pdodd maxx ∈ [ℓt, ut] |1 - p(x)|

After applying pt, the singular values that were in [ℓt, ut] are now in a new interval [ℓt+1, ut+1], where:

t+1 = minx ∈ [ℓt, ut] pt(x)    ut+1 = maxx ∈ [ℓt, ut] pt(x)

The key simplification

Here is a beautiful property of the optimal polynomial (proven as Lemma C.2 in the paper): for the optimal pt, the minimum is always attained at x = ℓt and the maximum error always equals 1 - pt(ℓt). Furthermore, ut+1 = 2 - ℓt+1. So the entire evolution is determined by tracking just one number: the lower bound ℓt.

t+1 = pt(ℓt),   ut+1 = 2 - ℓt+1,   errort = 1 - ℓt+1

The error at iteration t equals 1 - ℓt+1. So the error shrinks as ℓt approaches 1. This is the interval shrinkage that the greedy algorithm exploits: each polynomial pushes the interval [ℓt, ut] closer to the single point {1}.

Theorem 3.1 (Greedy optimality)

The composition p* = pT ∘ ... ∘ p1 constructed greedily is optimal among ALL compositions of T odd polynomials of degree d. The worst-case error is:

maxx ∈ [ℓ, u] |1 - p*(x)| = 1 - ℓT+1
Proof sketch (the key idea): The proof works by induction on T. The base case T = 1 is just the standard Chebyshev/Remez optimality result. For the inductive step: suppose the greedy composition of T-1 polynomials pT-1 ∘ ... ∘ p1 maps [ℓ, u] to [ℓT, uT]. Any composition of T polynomials can be written as pT ∘ q, where q is a composition of T-1 polynomials. The image of [ℓ, u] under q is some interval [a, b] ⊇ [ℓT, uT] (by inductive optimality of the greedy choice). Since [a, b] ⊇ [ℓT, uT], the optimal polynomial on [a, b] can do no better than the optimal polynomial on [ℓT, uT]. So the greedy choice at every step is globally optimal.

Convergence rate: super-exponential

The paper shows (Theorem 3.3) that the error satisfies:

||polar(M) - XT||2 ≤ |1 - ℓ2|(q+1)T

where d = 2q + 1. For d = 5 (q = 2), this is cubic convergence: the exponent triples with each iteration. For d = 3 (q = 1), quadratic convergence. This is the same asymptotic rate as Newton-Schulz — but Polar Express also dominates in the initial iterations, giving the best of both worlds.

Worked example: tracking the interval evolution

Let ℓ = 0.03, u = 1, d = 5. The greedy algorithm produces these intervals:

ttutError = 1 - ℓt+1
10.03001.0000.569
20.4311.5690.068
30.9321.0683.0 × 10-4
40.99971.00032.6 × 10-11
5≈1≈1≈0

After just 4 iterations, the error is at 10-11. By contrast, Newton-Schulz starting from ℓ = 0.03 has error ≈ 0.95 after 4 iterations. The early iterations of Polar Express do the heavy lifting.

Interval Evolution Visualizer

Watch how the greedy optimal polynomials progressively squeeze the interval [ℓt, ut] toward the single point {1}. Each panel shows one iteration's polynomial mapping the current interval closer to 1.

Initial ℓ 0.03
Iteration 0
In the greedy algorithm, what determines the new interval bounds [ℓt+1, ut+1] after applying the optimal polynomial pt?

Chapter 6: The Polar Express Algorithm

We now have all the pieces. Let's assemble the actual algorithm, including the finite-precision tricks that make it work in bfloat16.

Finding each optimal polynomial (degree 5)

At each iteration, we need to solve: find the odd degree-5 polynomial p(x) = ax + bx3 + cx5 that minimizes maxx ∈ [ℓ, u] |1 - p(x)|. For degree 5, the paper provides Algorithm 2 — a clean iterative procedure equivalent to the Remez algorithm but specialized to this simple case.

The Equioscillation Theorem says the optimal polynomial makes the error |1 - p(x)| equioscillate — it reaches its maximum with alternating signs at exactly (5+1)/2 + 1 = 4 points in [ℓ, u]. The two interior equioscillation points q and r are determined by the condition that the error values at ℓ, q, r, u alternate in sign and are equal in magnitude.

The algorithm solves a 4×4 linear system at each iteration until q and r converge. Here is the core loop (simplified from Implementation 2):

python
def optimal_quintic(l, u):
    # Find q, r (interior equioscillation points)
    q = (3*l + u) / 4
    r = (l + 3*u) / 4
    while True:
        # Solve 4x4 system: p(l)=1-E, p(q)=1+E, p(r)=1-E, p(u)=1+E
        LHS = np.array([
            [l, l**3, l**5,  1],
            [q, q**3, q**5, -1],
            [r, r**3, r**5,  1],
            [u, u**3, u**5, -1],
        ])
        a, b, c, E = np.linalg.solve(LHS, np.ones(4))
        # Update q, r to the actual equioscillation points
        q, r = np.sqrt((-3*b + np.array([-1, 1]) *
                    sqrt(9*b**2 - 20*a*c)) / (10*c))
        # Repeat until converged
    return a, b, c

Precomputed coefficients for ℓ = 10-3

When you run this for T = 6 iterations starting from ℓ = 10-3, u = 1, you get these coefficients (after applying the safety factor of 1.01):

Iterabc
18.287-23.59617.300
24.107-2.9470.545
33.949-2.9090.552
43.318-2.4880.510
52.301-1.6690.419
6+1.875-1.2500.375

Notice how the coefficients evolve. Iteration 1 has large coefficients (a = 8.3) to aggressively boost small singular values. By iteration 5, the coefficients approach Newton-Schulz (15/8, -10/8, 3/8) because once the singular values are near 1, Newton-Schulz's super-exponential convergence is optimal. Polar Express automatically transitions from aggressive early behavior to Newton-Schulz's polished late behavior.

The complete online algorithm (Implementation 1):
(1) Normalize: X0 = M / (||M||F + 10-2), in bfloat16.
(2) If m > n, transpose: X = XT (reduces FLOPs for tall matrices).
(3) For t = 1, ..., T: compute A = XTX, B = b·A + c·A2, X = a·X + X·B.
(4) If transposed, transpose back.
That's it. Three matrix multiplies per iteration. No branching, no conditionals. Pure GEMM.

Finite-precision considerations

Two issues arise when running in bfloat16:

Issue 1: Round-off can exceed ut. Numerical errors can create singular values slightly larger than the upper bound ut. If the polynomial overshoots for these values, the iteration can diverge. Fix: replace each pt(x) by pt(x/1.01), effectively widening the interval by 1%. This costs almost nothing in accuracy.

Issue 2: Non-monotonicity in early iterations. The optimal polynomial p1 has strong oscillations (from the equioscillation property). In the first iteration, these oscillations can cause the lower bound ℓ2 to be slightly negative in finite precision, which breaks the iteration. Fix: use a slightly suboptimal (but more monotone) polynomial in the first iteration by solving the minimax problem on [max(ℓt, ut/10), ut] instead of [ℓt, ut]. This cushions the lower bound at a negligible cost.

The design decision that matters: The Frobenius norm normalization X0 = M/||M||F ensures u ≈ 1 but gives a loose lower bound ℓ. The authors set ℓ = 10-3 as a guess for all gradient matrices. If the true σmin is much smaller, convergence is delayed by a few iterations but never breaks. If σmin is larger than 10-3, convergence is slightly slower than if we knew the true ℓ, but still faster than alternatives. Robustness to a bad ℓ is confirmed experimentally (Figure 3): even 2 orders of magnitude off, Polar Express remains competitive.
python — Implementation 1 (complete)
from itertools import repeat
import torch

coeffs_list = [
    (8.28721, -23.59589, 17.30039),
    (4.10706, -2.94748, 0.54484),
    (3.94870, -2.90890, 0.55182),
    (3.31842, -2.48849, 0.51005),
    (2.30065, -1.66890, 0.41888),
    (1.89130, -1.26800, 0.37680),
    (1.87500, -1.25000, 0.37500),  # Newton-Schulz
]
# Safety factor for bfloat16 stability
coeffs_list = [(a/1.01, b/1.01**3, c/1.01**5)
               for (a,b,c) in coeffs_list[:-1]] + [coeffs_list[-1]]

@torch.compile
def PolarExpress(G, steps=5):
    X = G.bfloat16()
    if G.size(-2) > G.size(-1): X = X.mT
    X = X / (X.norm(dim=(-2,-1), keepdim=True) * 1.01 + 1e-7)
    hs = coeffs_list[:steps] + list(
        repeat(coeffs_list[-1], steps - len(coeffs_list)))
    for a, b, c in hs:
        A = X @ X.mT
        B = b * A + c * A @ A
        X = a * X + B @ X   # X <- aX + bX^3 + cX^5
    if G.size(-2) > G.size(-1): X = X.mT
    return X
Why does Polar Express use different polynomial coefficients for the first few iterations but converge to Newton-Schulz coefficients for later iterations?

Chapter 7: Convergence Explorer

This is the showcase chapter. The interactive simulation below reconstructs Figure 3 from the paper — the convergence comparison of all methods — and lets you explore it with full control.

You control the condition number (ratio of largest to smallest singular value), the number of iterations, and the lower bound ℓ used by Polar Express. Watch how each method converges on a log scale. Pay attention to two regimes: the early iterations (where Polar Express dominates due to its adaptive polynomials) and the late iterations (where Polar Express matches or exceeds Newton-Schulz's asymptotic rate).

Convergence Race: All Methods

Spectral-norm error ||polar(M) - Xt||2 vs iteration, for a matrix with singular values log-spaced between σmin and 1. All methods use degree-5 polynomials. Toggle methods on/off with checkboxes.

σmin 10-6
Polar Express ℓ 10-6
Max iterations 20
What to look for in the simulation:
1. Set σmin = 10-6 (default). Newton-Schulz (green) flatlines until iteration 17. Jordan (red) plateaus at ~0.3. Polar Express (orange) reaches 10-3 by iteration 11.
2. Set σmin = 10-1 (close to 1). Now Newton-Schulz and Polar Express converge at similar rates. The adaptive polynomials automatically approach Newton-Schulz when the interval is already narrow.
3. Try setting Polar Express ℓ = 10-2 while σmin = 10-6. Convergence is delayed by ~3 iterations but eventually catches up. The method is robust to misspecified ℓ.
Breaking the method: Try setting ℓ to something wildly wrong — like 10-1 when σmin = 10-8. Polar Express won't diverge (it can't, since each polynomial is bounded on [0, 2]), but it won't make good progress until the actual singular values enter the interval where its polynomial is well-calibrated. The method degrades gracefully: bad ℓ = slow convergence. Never crash. This is a deliberate engineering choice.
When σmin is very small (10-6) and you use only 5 iterations, which methods reach practical accuracy (< 0.1 error)?

Chapter 8: GPT-2 Experiments

Theory is nice; does it actually help train language models? The authors integrate Polar Express into Muon and train GPT-2 models on the FineWeb dataset. The results are remarkably consistent.

Experimental setup

ParameterGPT-2-SmallGPT-2-Large
nembd7681280
nlayer1236
nhead1220
Vocab size50,257
Context length1024
Training dataFineWeb, 1B tokens (main) / 10B (extended)
Batch size32
PrecisionMixed (bfloat16)
GPUs4 × H100
LR scheduleConstant for 40%, then linear decay
Polar iterations5 (all methods)

All methods use degree-5 polynomials with 5 iterations, so they have exactly the same computational cost. The only difference is which polynomial coefficients they use. This makes the comparison perfectly controlled: any validation loss difference is purely due to the quality of the polar approximation.

GPT-2-Large results (1B tokens)

MethodBest LRFinal Val Loss
AdamW— (baseline)
muon-Jordan0.023.398
muon-You0.023.399
muon-PolarExp0.023.340

Polar Express achieves a validation loss of 3.340, a significant improvement over Jordan (3.398) and You (3.399). The advantage is consistent across all tested learning rates. Not just the best one — every single one.

GPT-2-Small results (1B tokens)

MethodBest LRFinal Val Loss
AdamW0.00054.197
muon-Jordan0.013.639
muon-You0.013.629
muon-PolarExp0.0053.588
Why does a better polar approximation improve training? This is the paper's most surprising finding. Using > 6 iterations or even the exact SVD does NOT improve validation loss compared to 5 iterations of Polar Express (Figure 5). So it's not about absolute accuracy. The hypothesis: the first 5 iterations of Polar Express produce a systematically better approximation in the early, high-error regime than Jordan or You. Since every training step uses this approximation, the accumulated advantage over millions of steps is significant. The gradient direction is slightly more correct at every single step.

Scaling to 10B tokens

When extended to 10B tokens (matching the Chinchilla scaling rule for GPT-2-Large), the advantage persists:

MethodBest LRFinal Val Loss (10B tokens)
muon-Jordan0.0022.921
muon-You0.0022.919
muon-PolarExp0.0022.913

The margin narrows at 10B tokens (0.006 vs 0.058 at 1B), but remains consistent and statistically significant. Polar Express is a strict improvement that never hurts.

Ablation: number of iterations

How many iterations do you need? The authors ablated from 2 to 30 iterations plus exact SVD:
- 2-3 iterations: notably worse validation loss.
- 5-6 iterations: optimal. More iterations give no benefit.
- 30 iterations: same loss as 5 iterations. Extra accuracy is wasted.
- Exact SVD: same loss as 5 iterations, but 2× slower per step.
The sweet spot is 5 iterations. The polar approximation only needs to be "good enough" — roughly 0.01 spectral error — and Polar Express reaches this threshold faster than alternatives.
Training Curve Comparison

Simulated validation loss curves for muon-Jordan, muon-You, and muon-PolarExp on GPT-2-Large (1B tokens). Based on the paper's reported results.

Why does using the exact SVD (instead of 5 iterations of Polar Express) NOT improve validation loss?

Chapter 9: Connections

Cheat sheet: key equations

SymbolMeaning
M = UΣVTSVD of momentum matrix M
polar(M) = UVTClosest semi-orthogonal matrix to M (replace singular values with 1)
p(x) = ax + bx3 + cx5Degree-5 odd polynomial applied to singular values
[ℓt, ut]Current interval bounds on singular values at iteration t
pt = argmin max |1-p(x)|Optimal polynomial for iteration t (Chebyshev/Remez)
t+1 = pt(ℓt)New lower bound = polynomial applied to old lower bound
ut+1 = 2 - ℓt+1New upper bound determined by equioscillation symmetry
error = 1 - ℓT+1Worst-case approximation error after T iterations

What the paper does NOT solve

Open questions:
1. Adaptive ℓ estimation. The paper uses ℓ = 10-3 for all matrices. A cheap way to estimate σmin per-matrix could further improve convergence. Power iteration gives σmax cheaply; σmin is harder.
2. Mixed-degree compositions. Theorem 3.1 allows different degrees at each iteration (e.g., degree-7 for iteration 1, degree-3 for later ones). The paper only tests fixed d = 5.
3. Beyond Muon. The polar decomposition appears in Shampoo, SOAP, spectral descent, and Procrustes alignment. Polar Express could accelerate all of these.
4. Scaling to larger models. The experiments go up to GPT-2-Large (774M). Kimi K2 (1T params) uses Muon with Newton-Schulz. Switching to Polar Express there could be impactful.

Related content on Engineermaxxing

Directly related:

Kimi K2 — Uses MuonClip (Muon + QK-Clip) to train a 1T-parameter MoE model. Their MuonClip uses Newton-Schulz for the polar factor; Polar Express would be a drop-in replacement.

Background:

Understanding how Muon uses the spectral norm for steepest descent connects to the broader story of preconditioned optimizers (Shampoo, SOAP, AdaFactor) that exploit matrix structure in gradients.

How to implement it

The complete implementation is 30 lines of Python (Implementation 1 in the paper). It is available at github.com/NoahAmsel/PolarExpress. It requires only PyTorch and is a drop-in replacement for any Newton-Schulz call in a Muon implementation. The coefficients are hardcoded, so there is zero setup cost.

When to use Polar Express: Any time you compute polar(M) or sign(M) using iterative polynomial methods. If you currently use Newton-Schulz with 5-10 iterations, switching to Polar Express's coefficients is a free improvement. No hyperparameter tuning needed. Same computational cost. Strictly better approximation at every iteration.

The paper has been adopted into the NanoGPT speedrun (modded-nanogpt), confirming its practical value in the competitive optimization community.

What makes Polar Express a "free" improvement over Newton-Schulz in existing Muon implementations?