Optimal polynomial approximations for the polar decomposition, and why they make the Muon optimizer strictly better at training LLMs.
You are training a large language model. At every step, you compute a gradient matrix G — this tells you which direction to nudge the weights to reduce loss. Standard SGD just subtracts this gradient (scaled by a learning rate) from the weights. Adam and AdamW are smarter: they rescale each parameter individually by dividing by a running estimate of the gradient's magnitude.
But there is a newer optimizer that has been breaking records. It is called Muon (MomentUm Orthogonalized by Newton-Schulz). Muon does something strange: instead of stepping in the direction of the gradient, it steps in the direction of the gradient's polar factor — the closest semi-orthogonal matrix to the gradient.
Why? Because the polar factor gives the steepest descent direction with respect to the spectral norm rather than the Frobenius norm. This single change has let Muon outperform Adam on the NanoGPT speedrun, train 1 trillion parameter models, and push the Pareto frontier of performance-vs-FLOPs for large language models.
So Muon uses an approximation. The standard one is Newton-Schulz iteration: apply a fixed polynomial to the matrix repeatedly, and the singular values converge toward 1 (which is what the polar factor does — it replaces every singular value with 1). Newton-Schulz uses only matrix-matrix multiplications, so it flies on GPUs.
But Newton-Schulz has a crippling weakness: it is glacially slow at the start. If some singular values are much smaller than others (which they are in real gradient matrices), Newton-Schulz makes almost no progress for the first 15+ iterations. And in deep learning, you can only afford 5 or 6 iterations per training step.
Heuristic alternatives exist. Keller Jordan proposed a hand-tuned polynomial that converges fast initially but never reaches high accuracy — it plateaus at an error of roughly 0.3 and stays there forever. Cesista, You, and Jordan proposed a 6-step sequence of different polynomials found by heuristic search, which is better but still fails to converge.
The simulation below shows the core problem. Each method applies a degree-5 polynomial at every iteration. Watch how Newton-Schulz flatlines for the first 15 iterations, while Jordan's method races ahead but then gets stuck. Can we do better than both?
Spectral-norm error of three methods approximating sign(x) on a matrix with singular values between 10-6 and 1. All use degree-5 polynomials. Click "Run" to watch them iterate.
Newton-Schulz applies the same polynomial at every iteration. Jordan and You apply different polynomials at each iteration but find them by heuristic search — trial and error with no optimality guarantee. The authors of this paper asked: what if we could find the mathematically optimal polynomial for each iteration?
Here is the beautiful result: greedy is optimal. At each iteration, you have a current interval of singular values [ℓ, u]. You pick the degree-d odd polynomial that maps this interval as close to 1 as possible. This greedy choice — optimize one step at a time, ignoring the future — turns out to produce the globally optimal composition of polynomials. Not just "pretty good." Provably the best possible.
Furthermore, finding the optimal polynomial for a single step reduces to a classical problem in approximation theory: uniform approximation. This problem was studied by Chebyshev in the 1850s. The solution is well-known and has a beautiful characterization called the Equioscillation Theorem: the optimal approximation is the unique polynomial whose error oscillates equally between its maximum and minimum values.
The result is Polar Express: an algorithm that precomputes optimal polynomials offline (in float64) and applies them online (in bfloat16). Five iterations of Polar Express consistently outperform all prior methods — Newton-Schulz, Jordan's heuristic, You's 6-step heuristic — at every iteration count, on both synthetic and real gradient matrices.
Before we can understand why Polar Express matters, we need to understand the machine it powers. Let's trace the data flow of a single Muon update step.
Let Wt ∈ ℝm×n be the weight matrix of a layer at iteration t. Let Gt ∈ ℝm×n be the stochastic gradient. Muon maintains a running momentum estimate Mt:
where β = 0.9 by default. Then the weight update is:
Compare this to SGD with momentum, which does Wt+1 = Wt - λ · Mt. The only difference is that Muon replaces Mt with polar(Mt).
Every matrix M has a Singular Value Decomposition (SVD): M = U Σ VT, where U and V are orthogonal matrices and Σ is diagonal with non-negative entries σ1 ≥ σ2 ≥ ... ≥ σr > 0 (the singular values).
The polar factor is defined as:
In other words: take the SVD, throw away the singular values, keep just the "rotation" part. This gives you the closest semi-orthogonal matrix to M — the matrix that points in the same "direction" as M but has all singular values equal to 1.
In theory, you could compute polar(M) by running torch.linalg.svd(M) and returning U @ V.T. The problem is runtime. For a gradient matrix of shape (1280, 1280) in a GPT-2-Large layer, the SVD takes roughly 2x the time of the forward+backward pass combined. And you need to compute polar(M) for every weight matrix at every training step.
The ablation experiments in this paper (Figure 5) confirm the damage: using the exact SVD in Muon doubles the median training step time from ~2 seconds to ~4 seconds. And the kicker? It doesn't even improve validation loss compared to a 5-iteration polynomial approximation. The extra precision is wasted.
The interactive simulation below lets you see what the polar factor does to a matrix. Drag the slider to change the condition number (ratio of largest to smallest singular value). Watch how polar(M) "normalizes" the singular values to 1 while preserving the rotation.
A 2×2 matrix M = UΣVT transforms the unit circle into an ellipse. polar(M) = UVT maps the circle to a (rotated) circle. Adjust the condition number to see how different σ1/σ2 ratios affect the geometry.
Now that we know what polar(M) is and why we need to compute it cheaply, let's understand the existing workhorse: the Newton-Schulz iteration.
Forget matrices for a moment. Consider a scalar x ∈ (0, 1]. We want to map it to 1 (since polar decomposition replaces each singular value with 1). The cubic Newton-Schulz polynomial is:
Let's check: p(1) = (3 - 1)/2 = 1. Good, 1 is a fixed point. What about x = 0.5? p(0.5) = (1.5 - 0.125)/2 = 0.6875. Apply again: p(0.6875) ≈ 0.8999. Again: p(0.8999) ≈ 0.9985. It's converging to 1! But what about x = 0.01? p(0.01) ≈ 0.015. p(0.015) ≈ 0.0225. After 5 iterations, we're at about 0.076. Barely moved. This is the problem.
In practice, Muon uses the degree-5 Newton-Schulz:
This converges faster (cubically instead of quadratically near x = 1), but still suffers the same slow-start problem for small x. Let's verify with a worked example:
| Iteration | x (start = 0.01) | x (start = 0.5) | x (start = 0.99) |
|---|---|---|---|
| 0 | 0.0100 | 0.5000 | 0.9900 |
| 1 | 0.0188 | 0.8984 | 1.0000 |
| 2 | 0.0352 | 1.0000 | 1.0000 |
| 3 | 0.0659 | 1.0000 | 1.0000 |
| 4 | 0.1233 | 1.0000 | 1.0000 |
| 5 | 0.2293 | 1.0000 | 1.0000 |
After 5 iterations starting from x = 0.01, we're only at 0.23 — still far from 1. The error is 1 - 0.23 = 0.77. But starting from x = 0.5, we converge in just 2 iterations. Newton-Schulz is great when you're already close to 1, and terrible when you're far away.
Keller Jordan realized the problem and proposed a different degree-5 polynomial found by numerical search:
This polynomial has a much larger derivative near x = 0 (p'(0) = 3.4445), so small singular values grow faster in the early iterations. But there's a catch: this polynomial does not have 1 as a stable fixed point. Instead, it has a fixed point around 0.7. So it converges rapidly to a neighborhood of 1 but never gets closer than about 0.3 error. It plateaus.
Cesista, You, and Jordan improved on this by using six different polynomials in sequence, each found by heuristic search. This reaches better accuracy than Jordan's single polynomial, but still fails to converge to zero error. Both heuristic methods are ad hoc — there is no guarantee they are anywhere near optimal.
The simulation below lets you trace how a single scalar value evolves under each method. Try starting values near 0 and near 1 to see the tradeoff in action.
Pick a starting value x0 and watch how each method maps it toward 1. Green = Newton-Schulz, red = Jordan, blue = Polar Express (adaptive).
We now arrive at the mathematical heart of the paper. The trick is to reframe the polar decomposition problem as a polynomial approximation problem, and then invoke 170 years of approximation theory to solve it optimally.
Recall: polar(M) = UVT, and we are constructing XT = p(M) = Up(Σ)VT for some composed polynomial p = pT ∘ ... ∘ p1. The spectral norm error is:
The last step uses the fact that U and V are orthogonal (they don't change the spectral norm). So the matrix problem reduces to a scalar problem: find the polynomial p that makes |1 - p(x)| as small as possible for all x in the interval [ℓ, u] where the singular values live.
There is a constraint: each pt must be an odd polynomial (only odd-degree monomials: x, x3, x5, ...). Why? Because for a rectangular matrix M = UΣVT, we can compute odd monomials of Σ using the formula:
This only involves matrix-matrix products — no SVD needed. But even-degree monomials like M2 = UΣ2VT ≠ (MTM) in general (for non-square M), so we can't compute them without the SVD. We are therefore restricted to odd polynomials.
Given degree d (odd), lower bound ℓ, upper bound u on the singular values, and T iterations, we seek:
where each pt is an odd polynomial of degree at most d. This is a minimax problem: minimize the maximum error.
Why compositions? Because they are computationally cheap. A single polynomial of degree dT would require Θ(dT/2) matrix-matrix products to evaluate (using the Paterson-Stockmeyer method). But a composition of T polynomials each of degree d requires only O(Td) products. For d = 5 and T = 5, that's 25 matrix products for a polynomial of degree 55 = 3125, versus √3125 ≈ 56 products for a single polynomial of the same degree. Compositions win.
Let's solve the simplest case by hand: d = 3. We have p(x) = ax + bx3, and we want to minimize maxx ∈ [ℓ, u] |1 - p(x)|. For d = 3, the odd polynomial has (3+1)/2 = 2 coefficients (a and b), so it's a member of a 2-dimensional space. By the Equioscillation Theorem (which we'll cover in the next chapter), the optimal polynomial has exactly 2 + 1 = 3 equioscillation points in [ℓ, u] where |1 - p(x)| attains its maximum with alternating signs.
For ℓ = 0.1 and u = 1, the optimal degree-3 polynomial turns out to be approximately p(x) ≈ 2.48x - 1.49x3. Let's check:
| x | p(x) | 1 - p(x) |
|---|---|---|
| 0.10 | 0.247 | +0.753 |
| 0.33 | 0.765 | +0.235 |
| 0.65 | 1.203 | -0.203 |
| 1.00 | 0.990 | +0.010 |
Compare to Newton-Schulz degree-3 [p(x) = (3x-x3)/2]: at x = 0.1, p(0.1) = 0.1495, giving error 0.8505. The optimal polynomial's error at x = 0.1 is only 0.753 — already better. The optimal polynomial sacrifices some accuracy near x = 1 to gain much more at x = 0.1.
Plot of p(x) for Newton-Schulz (green) vs optimal (orange) degree-5 polynomial on [ℓ, 1]. The target is y = 1 (dashed line). Adjust ℓ to see how the optimal polynomial adapts.
This chapter covers the paper's central theorem. We will derive why optimizing each polynomial greedily — step by step — produces the globally optimal composition.
Define ℓ1 = ℓ and u1 = u (the initial bounds on singular values). At iteration t, we have current bounds [ℓt, ut]. We choose pt to minimize:
After applying pt, the singular values that were in [ℓt, ut] are now in a new interval [ℓt+1, ut+1], where:
Here is a beautiful property of the optimal polynomial (proven as Lemma C.2 in the paper): for the optimal pt, the minimum is always attained at x = ℓt and the maximum error always equals 1 - pt(ℓt). Furthermore, ut+1 = 2 - ℓt+1. So the entire evolution is determined by tracking just one number: the lower bound ℓt.
The error at iteration t equals 1 - ℓt+1. So the error shrinks as ℓt approaches 1. This is the interval shrinkage that the greedy algorithm exploits: each polynomial pushes the interval [ℓt, ut] closer to the single point {1}.
The composition p* = pT ∘ ... ∘ p1 constructed greedily is optimal among ALL compositions of T odd polynomials of degree d. The worst-case error is:
The paper shows (Theorem 3.3) that the error satisfies:
where d = 2q + 1. For d = 5 (q = 2), this is cubic convergence: the exponent triples with each iteration. For d = 3 (q = 1), quadratic convergence. This is the same asymptotic rate as Newton-Schulz — but Polar Express also dominates in the initial iterations, giving the best of both worlds.
Let ℓ = 0.03, u = 1, d = 5. The greedy algorithm produces these intervals:
| t | ℓt | ut | Error = 1 - ℓt+1 |
|---|---|---|---|
| 1 | 0.0300 | 1.000 | 0.569 |
| 2 | 0.431 | 1.569 | 0.068 |
| 3 | 0.932 | 1.068 | 3.0 × 10-4 |
| 4 | 0.9997 | 1.0003 | 2.6 × 10-11 |
| 5 | ≈1 | ≈1 | ≈0 |
After just 4 iterations, the error is at 10-11. By contrast, Newton-Schulz starting from ℓ = 0.03 has error ≈ 0.95 after 4 iterations. The early iterations of Polar Express do the heavy lifting.
Watch how the greedy optimal polynomials progressively squeeze the interval [ℓt, ut] toward the single point {1}. Each panel shows one iteration's polynomial mapping the current interval closer to 1.
We now have all the pieces. Let's assemble the actual algorithm, including the finite-precision tricks that make it work in bfloat16.
At each iteration, we need to solve: find the odd degree-5 polynomial p(x) = ax + bx3 + cx5 that minimizes maxx ∈ [ℓ, u] |1 - p(x)|. For degree 5, the paper provides Algorithm 2 — a clean iterative procedure equivalent to the Remez algorithm but specialized to this simple case.
The Equioscillation Theorem says the optimal polynomial makes the error |1 - p(x)| equioscillate — it reaches its maximum with alternating signs at exactly (5+1)/2 + 1 = 4 points in [ℓ, u]. The two interior equioscillation points q and r are determined by the condition that the error values at ℓ, q, r, u alternate in sign and are equal in magnitude.
The algorithm solves a 4×4 linear system at each iteration until q and r converge. Here is the core loop (simplified from Implementation 2):
python def optimal_quintic(l, u): # Find q, r (interior equioscillation points) q = (3*l + u) / 4 r = (l + 3*u) / 4 while True: # Solve 4x4 system: p(l)=1-E, p(q)=1+E, p(r)=1-E, p(u)=1+E LHS = np.array([ [l, l**3, l**5, 1], [q, q**3, q**5, -1], [r, r**3, r**5, 1], [u, u**3, u**5, -1], ]) a, b, c, E = np.linalg.solve(LHS, np.ones(4)) # Update q, r to the actual equioscillation points q, r = np.sqrt((-3*b + np.array([-1, 1]) * sqrt(9*b**2 - 20*a*c)) / (10*c)) # Repeat until converged return a, b, c
When you run this for T = 6 iterations starting from ℓ = 10-3, u = 1, you get these coefficients (after applying the safety factor of 1.01):
| Iter | a | b | c |
|---|---|---|---|
| 1 | 8.287 | -23.596 | 17.300 |
| 2 | 4.107 | -2.947 | 0.545 |
| 3 | 3.949 | -2.909 | 0.552 |
| 4 | 3.318 | -2.488 | 0.510 |
| 5 | 2.301 | -1.669 | 0.419 |
| 6+ | 1.875 | -1.250 | 0.375 |
Notice how the coefficients evolve. Iteration 1 has large coefficients (a = 8.3) to aggressively boost small singular values. By iteration 5, the coefficients approach Newton-Schulz (15/8, -10/8, 3/8) because once the singular values are near 1, Newton-Schulz's super-exponential convergence is optimal. Polar Express automatically transitions from aggressive early behavior to Newton-Schulz's polished late behavior.
Two issues arise when running in bfloat16:
Issue 1: Round-off can exceed ut. Numerical errors can create singular values slightly larger than the upper bound ut. If the polynomial overshoots for these values, the iteration can diverge. Fix: replace each pt(x) by pt(x/1.01), effectively widening the interval by 1%. This costs almost nothing in accuracy.
Issue 2: Non-monotonicity in early iterations. The optimal polynomial p1 has strong oscillations (from the equioscillation property). In the first iteration, these oscillations can cause the lower bound ℓ2 to be slightly negative in finite precision, which breaks the iteration. Fix: use a slightly suboptimal (but more monotone) polynomial in the first iteration by solving the minimax problem on [max(ℓt, ut/10), ut] instead of [ℓt, ut]. This cushions the lower bound at a negligible cost.
python — Implementation 1 (complete) from itertools import repeat import torch coeffs_list = [ (8.28721, -23.59589, 17.30039), (4.10706, -2.94748, 0.54484), (3.94870, -2.90890, 0.55182), (3.31842, -2.48849, 0.51005), (2.30065, -1.66890, 0.41888), (1.89130, -1.26800, 0.37680), (1.87500, -1.25000, 0.37500), # Newton-Schulz ] # Safety factor for bfloat16 stability coeffs_list = [(a/1.01, b/1.01**3, c/1.01**5) for (a,b,c) in coeffs_list[:-1]] + [coeffs_list[-1]] @torch.compile def PolarExpress(G, steps=5): X = G.bfloat16() if G.size(-2) > G.size(-1): X = X.mT X = X / (X.norm(dim=(-2,-1), keepdim=True) * 1.01 + 1e-7) hs = coeffs_list[:steps] + list( repeat(coeffs_list[-1], steps - len(coeffs_list))) for a, b, c in hs: A = X @ X.mT B = b * A + c * A @ A X = a * X + B @ X # X <- aX + bX^3 + cX^5 if G.size(-2) > G.size(-1): X = X.mT return X
This is the showcase chapter. The interactive simulation below reconstructs Figure 3 from the paper — the convergence comparison of all methods — and lets you explore it with full control.
You control the condition number (ratio of largest to smallest singular value), the number of iterations, and the lower bound ℓ used by Polar Express. Watch how each method converges on a log scale. Pay attention to two regimes: the early iterations (where Polar Express dominates due to its adaptive polynomials) and the late iterations (where Polar Express matches or exceeds Newton-Schulz's asymptotic rate).
Spectral-norm error ||polar(M) - Xt||2 vs iteration, for a matrix with singular values log-spaced between σmin and 1. All methods use degree-5 polynomials. Toggle methods on/off with checkboxes.
Theory is nice; does it actually help train language models? The authors integrate Polar Express into Muon and train GPT-2 models on the FineWeb dataset. The results are remarkably consistent.
| Parameter | GPT-2-Small | GPT-2-Large |
|---|---|---|
| nembd | 768 | 1280 |
| nlayer | 12 | 36 |
| nhead | 12 | 20 |
| Vocab size | 50,257 | |
| Context length | 1024 | |
| Training data | FineWeb, 1B tokens (main) / 10B (extended) | |
| Batch size | 32 | |
| Precision | Mixed (bfloat16) | |
| GPUs | 4 × H100 | |
| LR schedule | Constant for 40%, then linear decay | |
| Polar iterations | 5 (all methods) | |
All methods use degree-5 polynomials with 5 iterations, so they have exactly the same computational cost. The only difference is which polynomial coefficients they use. This makes the comparison perfectly controlled: any validation loss difference is purely due to the quality of the polar approximation.
| Method | Best LR | Final Val Loss |
|---|---|---|
| AdamW | — | — (baseline) |
| muon-Jordan | 0.02 | 3.398 |
| muon-You | 0.02 | 3.399 |
| muon-PolarExp | 0.02 | 3.340 |
Polar Express achieves a validation loss of 3.340, a significant improvement over Jordan (3.398) and You (3.399). The advantage is consistent across all tested learning rates. Not just the best one — every single one.
| Method | Best LR | Final Val Loss |
|---|---|---|
| AdamW | 0.0005 | 4.197 |
| muon-Jordan | 0.01 | 3.639 |
| muon-You | 0.01 | 3.629 |
| muon-PolarExp | 0.005 | 3.588 |
When extended to 10B tokens (matching the Chinchilla scaling rule for GPT-2-Large), the advantage persists:
| Method | Best LR | Final Val Loss (10B tokens) |
|---|---|---|
| muon-Jordan | 0.002 | 2.921 |
| muon-You | 0.002 | 2.919 |
| muon-PolarExp | 0.002 | 2.913 |
The margin narrows at 10B tokens (0.006 vs 0.058 at 1B), but remains consistent and statistically significant. Polar Express is a strict improvement that never hurts.
Simulated validation loss curves for muon-Jordan, muon-You, and muon-PolarExp on GPT-2-Large (1B tokens). Based on the paper's reported results.
| Symbol | Meaning |
|---|---|
| M = UΣVT | SVD of momentum matrix M |
| polar(M) = UVT | Closest semi-orthogonal matrix to M (replace singular values with 1) |
| p(x) = ax + bx3 + cx5 | Degree-5 odd polynomial applied to singular values |
| [ℓt, ut] | Current interval bounds on singular values at iteration t |
| pt = argmin max |1-p(x)| | Optimal polynomial for iteration t (Chebyshev/Remez) |
| ℓt+1 = pt(ℓt) | New lower bound = polynomial applied to old lower bound |
| ut+1 = 2 - ℓt+1 | New upper bound determined by equioscillation symmetry |
| error = 1 - ℓT+1 | Worst-case approximation error after T iterations |
Directly related:
Kimi K2 — Uses MuonClip (Muon + QK-Clip) to train a 1T-parameter MoE model. Their MuonClip uses Newton-Schulz for the polar factor; Polar Express would be a drop-in replacement.
Background:
Understanding how Muon uses the spectral norm for steepest descent connects to the broader story of preconditioned optimizers (Shampoo, SOAP, AdaFactor) that exploit matrix structure in gradients.
The complete implementation is 30 lines of Python (Implementation 1 in the paper). It is available at github.com/NoahAmsel/PolarExpress. It requires only PyTorch and is a drop-in replacement for any Newton-Schulz call in a Muon implementation. The coefficients are hardcoded, so there is zero setup cost.
The paper has been adopted into the NanoGPT speedrun (modded-nanogpt), confirming its practical value in the competitive optimization community.