Mixture-of-Experts (MoE), Explained: Why “Active Parameters” Decide What Runs on Your Machine
Subscribe
Blog
Dark
Here is a puzzle that trips up almost everyone new to local AI: a 671-billion-parameter model can run at usable speeds on the right desktop, while a "smaller" 70B model feels sluggish on the same hardware. How? The answer is an architecture called Mixture-of-Experts (MoE) — and once you understand the single number it hinges on, model names like "Qwen 35B-A3B" or "DeepSeek-V3 671B-A37B" suddenly tell you exactly what your machine can and can't do.<br>This is the plain-English version: what MoE actually is, the one number that predicts performance, the memory trap that catches buyers, and the research behind it. No assuming you've read the papers.<br>The old rule MoE broke<br>In a traditional "dense" model, every parameter fires for every token. A 70B dense model does 70 billion parameters' worth of math to produce each word. That makes quality and speed a straight trade-off: bigger means smarter and slower. For years that was the iron law of local LLMs.<br>Mixture-of-Experts breaks it. Instead of one giant network, an MoE model splits much of itself into many smaller sub-networks called experts . For each token, a small router network picks just a few experts to actually run — the rest sit idle. So the model can be enormous in total, but only a slice of it does work at any moment. You get the knowledge of a huge model at the compute cost of a small one.<br>The one number that matters: active vs total parameters<br>Every MoE model has two parameter counts, and confusing them is the single most common mistake:<br>Total parameters — every expert added up. This determines how much memory the model needs.<br>Active (or "activated") parameters — what actually runs per token. This determines speed and compute.<br>The naming convention you'll see encodes both. "A3B " means 3B active; the number before it is the total. Some real examples:
ModelTotal paramsActive paramsRuns like a…
Mixtral 8×7B47B13B (top-2 of 8 experts)13B for speed, 47B for smarts<br>"35B-A3B" class~35B~3B3B-fast, 35B-smart<br>DeepSeek-V3671B37B37B for speed, 671B for smarts
Mixtral is the model that made this mainstream: per its technical report, each token routes to 2 of 8 experts, so it touches just 13B of its 47B parameters — yet it matches or beats Llama 2 70B and GPT-3.5. DeepSeek-V3 pushes the idea to the frontier: 671B total, 37B active . It thinks like a 671B model but computes like a 37B one.<br>Under the hood: what an “expert” actually is<br>It’s tempting to picture experts as little specialists — one for code, one for French, one for math. That’s a myth. An “expert” is simply a copy of the model’s feed-forward block (the dense number-crunching layer that sits after attention). An MoE swaps the single feed-forward block in each layer for many of them, and the router learns which combination to use — the specialization that emerges is statistical and messy, not human-readable.<br>Two details matter for understanding the behavior:<br>Attention usually stays dense. Only the feed-forward layers are split into experts; the attention mechanism still runs in full for every token. That’s part of why MoE quality doesn’t collapse despite the sparsity — the part of the model that mixes context together is untouched.<br>The router is tiny but decisive. A small gating network scores the experts per token and picks the top few. Train it badly and you get “dead” experts that never fire or hot experts that overload — the load-balancing problem that Switch Transformers and later DeepSeek-V3 spent real effort solving.<br>Newer designs add a twist: shared experts that run for every token (capturing common knowledge) alongside the routed experts that specialize, plus “fine-grained” experts — more, smaller experts for finer routing. That’s the DeepSeek recipe, and it’s why its active count (37B) buys more than the raw number suggests.<br>The memory trap: MoE saves compute, NOT memory<br>This is the part that catches buyers, so read it twice. Active parameters set your speed. Total parameters set your memory. The router only runs a few experts per token — but it could pick any of them next token, so all the experts must be sitting in fast memory, ready. You don't get to store only the active slice.<br>Concretely:<br>A ~35B-total MoE at 4-bit needs roughly ~18–20 GB just to hold the weights — the same as a 35B dense model — even though only ~3B are active. The memory bill is set by the total.<br>DeepSeek-V3's 671B, even quantized to 4-bit, wants ~380 GB — server-and-cluster territory — despite "only" 37B active. Fast to run if you can hold it; almost nobody can.<br>This is exactly why the large-unified-memory box became the local-LLM darling. A 128 GB Strix Halo, Framework Desktop, or Mac Studio isn't about raw compute — it's about having enough fast memory to hold every expert of a big MoE, so the model's tiny active footprint can then rip through tokens. MoE is the software trend that makes...