Re-quantizing a local model, 14× faster - Andrea Borio
Andrea Borio
SubscribeSign in
Re-quantizing a local model, 14× faster<br>Where a 2-bit model spends its bits, and why trying answers used to cost eighty minutes
Andrea Borio<br>Jun 10, 2026
Share
I’m writing this with DeepSeek-V4-Flash running on my Mac, on a coding build I quantized myself. It took about eighty minutes to make. The reason for this post is that I can now rebuild a tweaked version of it in five.<br>Thanks for reading! Subscribe for free to receive new posts and support my work.
Subscribe
The quant is about 81GB, the Mac has 64GB of RAM, and it runs anyway because antirez’s ds4 streams the experts off the SSD on demand instead of holding the whole model in memory. That streaming is its own kind of magic, and it’s also why the rest of this matters: when bits cost both quality and SSD reads, you really don’t want to waste them, and the only way to learn where they’re wasted is to try.<br>Quick primer for anyone who hasn’t been down this hole. Models store their knowledge as numbers, normally 16 bits each. To fit one on a laptop you round those down, to 4 bits, or if you’re desperate, 2. Two bits is repainting a Caravaggio with four crayons. You can still tell what it is. It just gets noticeably dumber.<br>So the game isn’t “use fewer bits.” It’s “waste fewer of them.”<br>The part that isn’t mine
DeepSeek is a Mixture-of-Experts model: instead of one big brain it’s a crowd of small specialists (256 of them per layer, across 43 layers), and for any given word only a few wake up. That’s exactly what makes SSD streaming work. You only pull the experts you actually use. There’s also a little router that picks them, plus some always-on shared parts.<br>So the move is: keep the decision-makers nearly perfect (8 bits) and be brutal with the experts (2 bits). Crushing a specialist that fires occasionally costs less than fuzzing the part that’s on for every word. antirez’s ds4 already does this, and does the actual quantization. My tool, forgequant, is a thin layer on top that turns a recipe into the right ds4 commands and records what it did. Credit where it’s due: the hard part is his.<br>What I added comes in two parts, and I’ll start with the one I’m most sure about.<br>The expensive part is trying
Here’s the thing nobody tells you about “where should the bits go”: the only honest way to answer it is to build a version and measure it. And building one is slow. A full quantize of this model is about eighty minutes. So you don’t experiment. You make one educated guess, wait an hour and a half, and live with whatever you got. The bottleneck was never ideas. It was the eighty minutes.<br>So I went after the eighty minutes first.<br>The key fact is that quantization is deterministic. Same weights, same target precision, same importance matrix, and you get the exact same bytes out, every time. Which means most of a “new” build isn’t new at all. When I take a 2-bit model and promote six layers to higher precision, only those six layers change. The other thirty-seven are byte-for-byte identical to the build I already have sitting on disk. Regenerating them is pure waste.<br>So I added --reuse PRIOR.gguf to my fork of ds4’s quantizer. You point it at a previous build, and instead of recomputing everything it copies the tensors that didn’t change and re-quantizes only the ones that did. In my test, that turned an eighty-minute rebuild into five and a half minutes: 1,310 of the 1,328 tensors copied straight across, 18 actually regenerated.<br>And it’s not “close enough.” I built the same variant both ways, the slow way and with --reuse, and compared all 1,328 tensors one by one. Every single one matched, byte for byte. The fast build isn’t an approximation of the slow one. It is the slow one, assembled differently.<br>It’s safe by construction. Each build stamps a fingerprint of its inputs (a hash of the model index, the imatrix, and every weight shard’s size and timestamp), and --reuse only copies a tensor when that fingerprint matches and the tensor’s type and shape line up exactly. Change the imatrix, swap the weights, ask for a different precision, and the fingerprint stops matching and it quietly rebuilds the affected parts from scratch. There’s no path where it hands you a stale tensor and calls it done.<br>As far as I can tell this doesn’t exist anywhere else. llama.cpp always requantizes from the source weights. The closest thing in the wild is manually splicing tensors between files by hand, which is exactly what you’d want a tool to do for you.<br>And if you want to A/B a single change immediately, there’s an even blunter option, splice: copy the high-precision layers straight out of a donor file, no quantization at all. Seconds to minutes.<br>The quiet consequence is the whole point. When a rebuild costs five minutes instead of an hour and a half, you stop guessing and start searching. You build the 2-bit base once, then spin off ten variants in the time the old way made one. And...