Expert-aware quantisation: near-Q4 quality at near-Q2 size?

Expert-aware quantisation: near-Q4 quality at near-Q2 size? - Martin Alderson

While researching and writing my last article on the history of KV cache compression, it occurred to me while there has been so much implemented research on KV cache efficiency, actual model weights quantisation is still pretty blunt.

This makes sense - at large scale with many tens of thousands of GPUs the weights themselves aren't a huge efficiency bottleneck for the most part, and KV cache starts dominating memory usage.

But, for us lowly serfs who don't have access to a warehouse full of HBM memory, it is a problem. The key constraint for local models is (mostly) just loading the weights into something fast enough.

Profiling

I spend a lot of time profiling applications to improve their performance, and a couple of months ago I built a tool to do the same for MoE models.

This got me thinking. What if instead of just quantising the entire model to a certain level - the blunt hammer I mentioned - we instead profile the model first and then quantise the "cold" experts selectively, for that specific set of tasks?

For this research I profiled Qwen3.6 35B-A3B on C++ programming tasks. There's an important nuance worth flagging up front: this particular model is very well load-balanced, so when it's reading code it spreads the work across its experts almost evenly (a per-layer Gini coefficient near 0 - basically uniform). The selectivity only shows up when it's generating code.

And there it concentrates hard. Running a handful of C++ generation prompts through, the per-layer Gini coefficient jumps to 0.61 - meaning the top 32 of the 256 experts handle ~50% of the routing during code generation, versus the 12.5% you'd expect at random. That concentration is exactly what we can exploit: if only a subset of experts really matter for the task, we can keep those at high precision and crush the rest.

Once we've got these traces showing which experts are hot (used a lot for the specific domain) vs cold (not used), we can then move on to the next step.

Model neurosurgery

This took (Claude Code) a fair while - ironically I suspect Fable would have been perfect for this kind of task.

The core idea was to allow llama.cpp to read different levels of quantisation per expert, which had a fair few issues. Eventually though, it figured it out (running autonomously for a good 90 minutes!).

It also wrote a script to take the profiling data and do quantisation per expert.

Results

All numbers below are perplexity (lower is better) measured on CPU. "Reading code" is a held-out chunk of real C++ source; "writing code" is a set of the model's own C++ generations. The tiered models keep a "hot" set of 64 experts (out of 256) at high precision and drop the other 192 "cold" experts to 2-bit.

Model Size PPL - reading PPL - writing

Q8 (baseline) 35GB 1.568 n/a *

uniform Q4 20GB 1.582 1.449

Q8-hot / Q2-cold (profiled) 18GB 1.620 1.456

Q8-hot / Q2-cold (random) 18GB 1.667 1.492

Q4-hot / Q2-cold (profiled) 14GB 1.635 1.477

Q4-hot / Q2-cold (random) 14GB 1.684 1.511

uniform Q2 13GB 2.103 1.977

* The "writing code" eval was generated by the Q8 model, so scoring Q8 against it would be circular - it's left out.

A few things jump out.

First, the baseline. Full-fat Q8 (35 GB) scores 1.568 reading C++, and a "blunt" Q2 quantisation of everything (13 GB) jumps to 2.103 - a big drop in quality for less than half the size. (Perplexity is roughly "how surprised the model is at each token" - so going from 1.57 to 2.10 is the model getting noticeably dumber, not lobotomised, but clearly worse.)

Now the actual experiment. I A/B tested the tiered approach two ways: random - pick the hot experts arbitrarily, as a control - versus profiled - keep the experts our profiling flagged as hot for C++ and crush the cold ones. The profiled version wins every single time: across two precision tiers and both eval sets, that's four out of four. With Q8 hot / Q2 cold (18 GB), random tiering scores 1.667 while the profiled version recovers nearly half of that gap back towards Q8, landing at 1.620. So the core idea works - which experts you protect matters, and the profile tells you which ones.

But here's the catch I have to be honest about: uniform Q4 is really good. On code, 4-bit is almost lossless - Q4 (20 GB) scores 1.582, basically tying Q8. So the fancy Q8-hot/Q2-cold model, despite all the cleverness, doesn't actually beat just using Q4 everywhere at a similar size.

The win shows up when you go smaller than Q4. I built a Q4-hot / Q2-cold version - 4-bit for the hot experts, 2-bit for the cold ones - which comes in at 14 GB, just 1 GB more than the blunt Q2 model. And it scores 1.635 reading and 1.477 writing - recovering ~90% of the quality gap between Q2 and Q4 for that single extra gigabyte. That's the real result: near-Q4 quality at near-Q2 size, by spending your bits on the experts that actually matter for the task.

Conclusion

This is absolutely...

Expert-aware quantisation: near-Q4 quality at near-Q2 size?

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI