Reading MAI's efficiency gain. How to pick architectures like serious people

Reading MAI's efficiency gain — idlemachines

← essays ← essays

Microsoft's MAI-Thinking-1 report is mostly about a 35B-active, ~1T-total sparse MoE reasoning model (and the report is so detailed by the way, we'll be digging into this over the next few weeks), but the bit I want to talk about today is a small methodological idea tucked into the pre-training section. There is a perennial problem when designing a model, it's always a trade-off between compute budget and the final loss it reaches. And while you might point to the Chinchilla scaling laws and pick the point on the curve that optimises for your compute budget, that only works if you trust the curve to be the same for your new design as it was for the old one. There is one glaring caveat in this: what FLOPs do you actually get from your own training stack and your cluster? And more importantly if you're paying by the hour, what is the real wall time? Because I bet it doesn't match anyone else, not quite.

This is where MAI's efficiency gain (EG) comes in. It's a metric that says how much better or worse a candidate design is than the baseline, and it can be computed on any cost axis we like. What we expect in lots of cases is that the optimal model from a FLOPs perspective will not be the optimal model from a wall-clock time perspective.

Two kinds of efficiency

MFU (model FLOPs utilisation) is the ratio of the actual FLOPs you get out of your training run to the theoretical peak FLOPs of your hardware.

It's worth thinking a bit more about what the two types of efficiency we're talking about here mean. Your GPU has a headline number that (probably Nvidia, but bonus points if you're using something left field) tells you the theoretical peak FLOPs it can compute, but real models don't get anywhere close. A well optimised Transformer might be sitting around 50-60%, and a more exotic architecture might be down in the 20s. That means that if you design a model that's cheap in FLOPs, but the kernels are bad and the MFU is low, you might end up with a model that looks great on paper but crawls on your real (and very expensive) hardware.

Counting FLOPs does have one really important virtue, it is independent of the implementation. If you have a new idea, you don't have to wait for the kernels to be optimised to know if it's good. Established mature architectures have had years of development and optimisation making their kernels incredibly fast, this makes it hard for a new idea to compete on wall-clock time until the kernels are optimised, but if you look at FLOPs you can see the potential of the idea without worrying about the implementation details.

But as soon as we actually start training the model, and it doesn't matter if we're renting time from a cloud, or if you're lucky enough to have on-prem hardware, we are always fighting to minimise the wall-clock time. That's either money out of your pocket, other people needing to share the cluster, or just fewer of your own experiments getting run.

These two numbers rarely line up, which is the delightful tension MAI's efficiency gain (EG) lets us reason about. The rough outline is that we fit a curve to a ladder of baseline runs so we know what loss the baseline reaches for any compute budget; then for a candidate that reaches some loss, we ask how much compute the baseline would have needed to reach that same loss, and divide by what the candidate actually spent. An EG above 1 means the candidate got there for less, so it wins; below 1 means it lost. Put FLOPs in for "compute" and we get EGFLOPs\text{EG}_\text{FLOPs}EGFLOPs; put wall-clock time in and we get EGTime\text{EG}_\text{Time}EGTime.

The disagreement in MAI's numbers

Table 2 of the report compares MAI-Base-1's interleaved layout, high-sparsity 8/512 MoE layers alternating with dense FFN layers, against the more conventional choice of a medium-sparsity MoE in every layer. There are two every-layer candidates, measured on the L12–L30 rungs of the ladder, with EG aggregated across the eval suite using their code-heavy weighting (Eq. 3 in the report):

Target=0.5 Code+0.175 STEM+0.175 Math+0.1 General+0.05 Multilingual\text{Target} = 0.5\,\text{Code} + 0.175\,\text{STEM} + 0.175\,\text{Math} + 0.1\,\text{General} + 0.05\,\text{Multilingual}Target=0.5Code+0.175STEM+0.175Math+0.1General+0.05Multilingual

Candidate (vs interleaved baseline)EGFLOPs\text{EG}_\text{FLOPs}EGFLOPs ↑EGTime\text{EG}_\text{Time}EGTime ↑MoE every layer (8/384)0.940.73MoE every layer (7+1 shared/384)1.030.82 So here if we look at FLOPs, the 7+1 shared variant looks strong, EGFLOPs>1.0\text{EG}_\text{FLOPs} > 1.0EGFLOPs>1.0 means it beats the baseline on the FLOPs axis. But on the time axis, both candidates look bad, and while the 7+1 shared variant is better than the plain 8/384 variant, it's still a loss compared to the baseline. And a 3% win on FLOPs is not going to make up for an 18% loss on time, so the overall verdict is that the interleaved layout is still...

Reading MAI's efficiency gain. How to pick architectures like serious people

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy