Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon

Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon | Tri Dao %E2%9A%9B%EF%B8%8F"> Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon

Muon is becoming the optimizer of choice for training state-of-the-art language models like Kimi K2 Thinking1 and GLM-5.2 Compared to AdamW, Muon needs fewer optimizer steps to reach a given loss, but each step is more expensive. This overhead is due to Muon’s Newton-Schulz orthogonalization procedure, a cubic time matrix operation not present in older optimizers.

Figure 1: AdamW vs. Muon: Wall clock time of optimizer step across Llama model sizes, benchmarked on B300. Muon’s superior optimization quality justifies its more expensive optimizer step. However, as model size scales up, the overhead of computing each Muon step grows rapidly. Traditional optimization methods (SGD, AdamW) perform element-wise operations, such as updating the momentum or rescaling it by the second moment. For a weight matrix of size $n \times m$, performing the optimizer step takes $O(mn)$ time given the gradient matrix as input. In contrast, many modern optimizers (Muon,3 Scion,4 Dion,5 SOAP,6 Shampoo,7 SPlus,8 etc.) use orthogonalization or higher-order preconditioning to compute the update to the weights at each training step. These methods require matrix multiplications that cost $O(mn^2)$ time (assuming $n \leq m$). Therefore, the runtime of each call to the optimizer is far greater than for AdamW. Depending on the training setup (global batch size, cluster size, and parallelism settings), Newton-Schulz accounts for between 2% and 17% of end-to-end wall clock time. While $O(mn^2)$ runtime is an unavoidable cost of these algorithms, there is still significant room for improvement in both FLOPs and wall clock time. As it is typically implemented, the Newton-Schulz routine has several shortcomings: It uses not just one or two, but ten multiplications of $n \times m$ matrices, costing $2mn^2$ FLOPs each. Most weights in popular architectures are rectangular, with $m \gg n$, and those of recent MoE architectures with many fine-grained experts are even more rectangular. Thus, the rectangular matrix multiplications dominate the costs of other operations (like small multiplications of $n \times n$ matrices). Many of the intermediate matrices it computes are symmetric, but no computational advantage is taken of this structure. Half the work used to compute these matrices is redundant. It uses cuBLAS for batched matrix multiplication/addition $\alpha \mathbf A \mathbf B + \beta \mathbf C$, which is not fully optimized for the Hopper GPU architecture. Previous work has sought to improve Newton-Schulz by optimizing its polynomial coefficients or its normalization step. While this can reduce the number of iterations needed for Newton-Schulz to converge, it does not address the shortcomings listed above. Others9 have implemented Newton-Schulz using special-purpose symmetric matrix multiplication routines, but the runtime benefit is limited due to the high number of rectangular and non-symmetric multiplications. While Newton-Schulz and related methods have been studied for decades in the numerical analysis literature, research attention has mostly focused on regimes where high accuracy is required, where algorithms are optimized for CPUs rather than GPUs, or where input matrices are square. In recent years, randomized sketching has been used to design sophisticated algorithms for many computations involving highly rectangular matrices; however (aside from further optimizing the coefficients10) these do not seem to be applicable to Muon. Our Contributions To address these shortcomings, we introduce Gram Newton-Schulz , a reworking of the Newton-Schulz routine that reduces the optimizer time by up to 50% in trillion-parameter MoE models like Kimi K2. Instead of iterating on the rectangular input matrix $\mathbf{X} \in \mathbb{R}^{n \times m}$, Gram Newton-Schulz iterates on the small square symmetric Gram matrix $\mathbf{XX^\top} \in\mathbb{R}^{n \times n}$, reducing the FLOP cost and enabling a greater use of symmetric GEMM kernels. Our contributions are as follows. First, we show how to rewrite standard Newton-Schulz in a form that is mathematically identical, producing the exact same output up to floating-point error, but that acts mostly on the space of $n \times n$ matrices. Because these matrices are smaller and admit specialized symmetric matrix multiplication routines, each iteration is faster than in standard Newton-Schulz. Only the preprocessing step (forming $\mathbf X \mathbf X^\top$) and the post-processing step (multiplying by $\mathbf X$) require rectangular matrix multiplications. We call this new form Naive Gram Newton-Schulz. Second, we conduct a thorough study of the numerical properties of Naive Gram Newton-Schulz. We identify the potential for numerical instability when using half-precision floating point arithmetic, especially due to...

Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy