Local Models in Mid-2026

colescodes1 pts0 comments

Local models in mid-2026: the engineering that closed the gap · coles.codes<br>The 2026 local-model story is quieter than the headlines suggest. Open weights did not catch up to the frontier, but they got close enough on the work most of us do day to day. Running LLM&rsquo;s locally yourself isn&rsquo;t just a hobby project anymore and turned into a reasonable choice if you&rsquo;re after a basic model for writing and research, or running as a specialised agent.<br>What I find interesting is the engineering that got us here, and the progress didn&rsquo;t just mean we had to get more RAM to run bigger models. If anything it was the reverse: people figured out how to spend less compute and less memory per token without losing quality.<br>My current favourite models #<br>Qwen 3.6 shipped an open dense 27B alongside a 35B mixture-of-experts that only fires about 3B parameters per token. Gemma 4 from Google spans a spread of sizes, and the larger ones punch well above their weight. GLM-5 is a 744B Mixture of Experts (MoE) and Kimi K2.6 is a trillion parameters with 32B active (although both GLM-5 and Kimi K2 require a bit too much RAM to run with my local setup 😅!). DeepSeek previewed V4 in April in two flavours, Flash and Pro, both MoE with a million-token context.<br>Interestingly, almost no model in the above list is a dense model you load whole. The overall parameter counts are large and the active counts are small, and that gap is what the rest of this comes down to.<br>Sparse attention #<br>Standard attention is quadratic: the work grows with the square of the context length. Every new token has to look back at every token before it, so a context twice as long means twice as many tokens each doing twice as much looking, which is four times the work. Ten times the context is a hundred times the work. At a million tokens that really adds up, and it&rsquo;s probably why context windows stayed small for so long (that, and models losing track of what you originally asked once the context got long).<br>attends to -><br>t1 t2 t3 t4 t5 t6 t7 t8<br>full t4 ■ ■ ■ ■ every token reads every<br>t8 ■ ■ ■ ■ ■ ■ ■ ■ token before it: n tokens<br>doing n look-backs = n^2

sparse t8 ■ · · ■ · · ■ ■ a small recent window plus<br>the indexer's top-k picks

DeepSeek&rsquo;s work is the great example of sparse attention. DeepSeek V3.2 introduced what they call DeepSeek Sparse Attention, and V4 builds on it. The mechanism is a &ldquo;lightning indexer&rdquo;, a cheap scoring function running in FP8 that decides, for each query token, which earlier tokens are actually worth attending to. You keep a small sliding window of recent tokens at full resolution for local coherence, and for everything older you attend only to the top-k the indexer flagged. Complexity drops from quadratic to roughly linear in the selected set. The indexer runs on a separate CUDA stream so its latency hides behind work that&rsquo;s already happening instead of landing on the critical path.<br>DeepSeek reported V4-Pro needs something like a quarter of the per-token inference FLOPs and a tenth of the KV cache that V3.2 needed at million-token context, which is the gap between long context working as a demo and working as something you can actually build on.<br>Mixture-of-experts #<br>MoE is the reason a trillion-parameter model runs at all. Instead of one big dense feed-forward network, you have many smaller &ldquo;expert&rdquo; networks and a router that sends each token to a handful of them. Kimi K2.6 has 384 experts and activates eight plus a shared one per token. GLM-5 activates roughly 40B of its 744B. The model has the knowledge capacity of its full parameter count but the per-token cost of something much smaller.<br>It&rsquo;s worth noting you still have to hold every expert in memory even though you only touch a few per token. So MoE is cheap on compute and bandwidth but very heavy on capacity. That tension is exactly why the hardware section below matters, and why unified-memory machines turned out to suit these models almost by accident.<br>The KV cache problem #<br>People underestimate this one. For long context, and for reasoning models that emit twenty thousand tokens of working-out, the dominant memory cost at inference isn&rsquo;t the weights: it&rsquo;s the KV cache, the stored keys and values for every token you&rsquo;ve seen so far. It grows linearly with context and it has to stay in fast memory.<br>Two lines of attack showed up everywhere this year. The first is Multi-head Latent Attention, DeepSeek&rsquo;s trick of compressing the KV cache into a low-rank latent rather than storing it in full, which cuts the footprint by something like ninety percent. Kimi and others adopted variants. The second is simpler: store the cache at lower precision, FP8 and increasingly FP4, which halves or quarters the memory for a small accuracy cost you can mostly train back. Combine compressed attention with a compressed, quantised cache and the long-context memory wall moves a long way out.<br>Multi-token prediction...

token rsquo context models memory long

Related Articles