Mixture-of-Experts (Moe), Explained: Why "Active Parameters" Decide What Runs

Mixture-of-Experts (MoE), Explained: Why “Active Parameters” Decide What Runs on Your Machine

Blog

Dark

Here is a puzzle that trips up almost everyone new to local AI: a 671-billion-parameter model can run at usable speeds on the right desktop, while a "smaller" 70B model feels sluggish on the same hardware. How? The answer is an architecture called Mixture-of-Experts (MoE) — and once you understand the single number it hinges on, model names like "Qwen 35B-A3B" or "DeepSeek-V3 671B-A37B" suddenly tell you exactly what your machine can and can't do. This is the plain-English version: what MoE actually is, the one number that predicts performance, the memory trap that catches buyers, and the research behind it. No assuming you've read the papers. The old rule MoE broke In a traditional "dense" model, every parameter fires for every token. A 70B dense model does 70 billion parameters' worth of math to produce each word. That makes quality and speed a straight trade-off: bigger means smarter and slower. For years that was the iron law of local LLMs. Mixture-of-Experts breaks it. Instead of one giant network, an MoE model splits much of itself into many smaller sub-networks called experts . For each token, a small router network picks just a few experts to actually run — the rest sit idle. So the model can be enormous in total, but only a slice of it does work at any moment. You get the knowledge of a huge model at the compute cost of a small one. The one number that matters: active vs total parameters Every MoE model has two parameter counts, and confusing them is the single most common mistake: Total parameters — every expert added up. This determines how much memory the model needs. Active (or "activated") parameters — what actually runs per token. This determines speed and compute. The naming convention you'll see encodes both. "A3B " means 3B active; the number before it is the total. Some real examples:

ModelTotal paramsActive paramsRuns like a…

Mixtral 8×7B47B13B (top-2 of 8 experts)13B for speed, 47B for smarts "35B-A3B" class~35B~3B3B-fast, 35B-smart DeepSeek-V3671B37B37B for speed, 671B for smarts

Mixtral is the model that made this mainstream: per its technical report, each token routes to 2 of 8 experts, so it touches just 13B of its 47B parameters — yet it matches or beats Llama 2 70B and GPT-3.5. DeepSeek-V3 pushes the idea to the frontier: 671B total, 37B active . It thinks like a 671B model but computes like a 37B one. Under the hood: what an “expert” actually is It’s tempting to picture experts as little specialists — one for code, one for French, one for math. That’s a myth. An “expert” is simply a copy of the model’s feed-forward block (the dense number-crunching layer that sits after attention). An MoE swaps the single feed-forward block in each layer for many of them, and the router learns which combination to use — the specialization that emerges is statistical and messy, not human-readable. Two details matter for understanding the behavior: Attention usually stays dense. Only the feed-forward layers are split into experts; the attention mechanism still runs in full for every token. That’s part of why MoE quality doesn’t collapse despite the sparsity — the part of the model that mixes context together is untouched. The router is tiny but decisive. A small gating network scores the experts per token and picks the top few. Train it badly and you get “dead” experts that never fire or hot experts that overload — the load-balancing problem that Switch Transformers and later DeepSeek-V3 spent real effort solving. Newer designs add a twist: shared experts that run for every token (capturing common knowledge) alongside the routed experts that specialize, plus “fine-grained” experts — more, smaller experts for finer routing. That’s the DeepSeek recipe, and it’s why its active count (37B) buys more than the raw number suggests. The memory trap: MoE saves compute, NOT memory This is the part that catches buyers, so read it twice. Active parameters set your speed. Total parameters set your memory. The router only runs a few experts per token — but it could pick any of them next token, so all the experts must be sitting in fast memory, ready. You don't get to store only the active slice. Concretely: A ~35B-total MoE at 4-bit needs roughly ~18–20 GB just to hold the weights — the same as a 35B dense model — even though only ~3B are active. The memory bill is set by the total. DeepSeek-V3's 671B, even quantized to 4-bit, wants ~380 GB — server-and-cluster territory — despite "only" 37B active. Fast to run if you can hold it; almost nobody can. This is exactly why the large-unified-memory box became the local-LLM darling. A 128 GB Strix Halo, Framework Desktop, or Mac Studio isn't about raw compute — it's about having enough fast memory to hold every expert of a big MoE, so the model's tiny active footprint can then rip through tokens. MoE is the software trend that makes...

Mixture-of-Experts (Moe), Explained: Why "Active Parameters" Decide What Runs

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs