Towards local plug-and-play AI

adlrocha1 pts0 comments

@adlrocha - Towards local plug-and-play AI

@adlrocha Beyond The Code

SubscribeSign in

@adlrocha - Towards local plug-and-play AI<br>Local LLM inference optimisations: from attention mechanisms to predictive decoding and software-model-hardware implementations.<br>adlrocha<br>May 17, 2026

Share

Last week I wrote about the hardware side of running AI locally, why memory bandwidth matters more than raw compute, which machines are worth building, and where the market is heading. If you missed it, start there as this post builds directly on top of it.<br>In the quest of becoming AI independent, your hardware sets the ceiling, but what decides how close you actually get to it is software.<br>Two machines with identical GPUs, identical VRAM, identical bandwidth, one running naive inference, one running an optimised stack can produce a 3-5x difference in tokens per second. This can mean the difference from running at 5tok/s to the 20-30tok/s that you need to get something usable. Even more, some techniques, and software-model-hardware optimised implementations may allow you to fit large models like DeepSeeekV4-Flash on a MacBook with at least 96GB of RAM. Same model, same hardware, different choices in the software layer (I feel we have a lot to learn from the video encoding and compression industry in this respect. We have to squeeze the most of every $ of hardware resources).<br>Last week I presented my new goal to create an inference box that generates tokens fast enough within a certain price point. So far I haven’t found one that fits my needs fully. What I am hoping is that the outcome of this work finds me a hardware configuration optimised for local inference and that is plug-and-play and not absurdly expensive, and/or a tool that detects your current hardware and suggests the best model and configuration for it.<br>This post continues that search focusing on the latest techniques to improve my inference stack.

MoE vs dense models

Before we get to the software tricks, there’s an architectural decision that sits underneath all of them, because it changes what the software layer has to deal with.<br>Most of the interesting models you’re likely to run locally with a decent throughput are Mixture-of-Experts architectures. Qwen3.6-35B-A3B, Qwen3-235B-A22B-250, DeepSeek-V4. The naming convention tells you the structure: 35B total parameters, but only 3B active per token. The model is divided into expert sub-networks, and a router that decides which ones fire for each token.<br>The key advantage of these types of models (and why is the one with biggest changes to fit your hardware) is that If only 3B of 30B parameters do any work per token, you get something close to 3B-scale inference speed while the model carries 30B-scale knowledge . With the right serving trick, like llama.cpp’s -ngl 99 -ncmoe 99 flags which keep the attention and shared weights on the GPU and offload cold expert FFN layers to system RAM, a Qwen3.6-35B-A3B can hit 33.5 tok/s on an RTX 3070 Ti with just 8GB of VRAM, provided you have 64GB+ of fast system RAM for the offloaded experts. The floor for running a 35B-knowledge model just dropped further than most people realise.<br>The main downside is consistency. And if you have used one of these MoE long enough you have probably experienced what I am about to describe.<br>Think of a MoE model as a hospital. Each patient gets routed to the right specialist. But which specialist fires depends on the token. The model can feel sharp in one domain and noticeably weaker in another depending on which expert activates. Dense models don’t have this problem because every parameter processes every token, every time. Slower, more expensive, but completely consistent. This lack of consistency can be experienced through tool call loops, to performance degradation and catastrophic forgetting.<br>For reasoning tasks, for long-context coherence, for anything where you need the model to stay sharp across a 50,000-token context, dense models tend to be better. This is why I would always recommend dense models for any agentic task that requires several assistant turns and accurate context keeping.<br>There’s a serving problem too: when too many tokens in a batch route to the same expert simultaneously, that expert’s buffer overflows and tokens get dropped silently. The model will not warn you of this happening, and it just gets worse. Dense inference has none of that complexity.<br>So how do you choose between dense and MoE models? Here’s the practical decision tree that I currently use myself:<br>8GB VRAM GPU + 64GB system RAM ? MoE with expert offload is your only real option for a capable model. Qwen3.6-35B-A3B at Q4 or Gemma4-26B-A4B with llama.cpp offload fits this profile. Throughput will be CPU-bandwidth-bound, not GPU-bound, so a fast DDR5 system matters more than GPU generation here.<br>16–24GB VRAM (RTX 3090, RTX 4090, RTX 4080)? you have a genuine choice. Dense Qwen3.6-27B at Q4_K_M fits in ~16GB with no offloading, no serving complexity,...

model hardware models inference dense software

Related Articles