Making Local LLM Fast

Bogdan's Ramblings - Technology blog

10 Jun 2026

on Programming

-->

I built a tool called Fono. It’s a voice front-end for your computer that can run entirely on your own machine. No cloud, no account, no audio leaving the laptop. It has three jobs, on two hotkeys:

Hold F7 to dictate and cleaned-up text lands in whatever window you were typing in.

Hold F8 to talk to an assistant and it answers various thinkgs. It can also call tools (look at what’s on your screen, and more to come) to actually do things, not just reply.

Give your computer a voice. Coding agents or other tools can speak to you through it.

The first two jobs can lean on a local language model. The problem is that a local model has to think before it speaks, and that feels slow in a way a cloud API running on a rack of GPUs does not. On my laptop, the very first assistant turn was taking almost three seconds before a single word came back. By the sixth turn of a conversation it was closer to seven. That’s death by a thousand milliseconds. It makes the whole thing feel broken even when it’s working perfectly.

This post is the story of getting that first word out in about a third of a second instead. I will explain what actually costs the time, show you the bug I made (it’s a good one) and give you the exact commands to reproduce everything on your own hardware. I don’t do “trust me bro” benchmarks :)

What actually takes the time

When you send a prompt to a local model, the work splits into two very different phases.

Prefill is the model reading your prompt. Every token (every chunk of text) gets pushed through the neural network so the model “understands” the context before it responds.

Decode is the model answering. You watch this happen, word by word.

The number a user feels is the time to first token . That’s the gap between letting go of the hotkey and the first word appearing. (In Fono it’s a bit more complex, but we’ll go with this.) That gap is almost entirely prefill. Decode controls how fast the answer then streams out, but prefill is what makes you sit there wondering if anything is happening.

If you are not careful, a local chat assistant makes you pay prefill over and over. Every turn, the model re-reads the entire conversation so far: the system instructions, every previous question, every previous answer. All that work just to append one new sentence. Turn six pays for turns one through five all over again.

Detour: how prefill and decode stress hardware

Let’s spend 2 minutes on this because I find it very interesting. Prefill and decode bottleneck on different parts of your computer.

Prefill is compute-bound. All the prompt’s tokens can go through the network together, in one big parallel batch of matrix multiplications. That keeps every core and every SIMD lane (the AVX/NEON instructions that do many multiply-adds per clock) busy. More cores and wider vector instructions make prefill faster. This is why the right build of llama.cpp for your CPU matters a little bit.

Decode is memory-bandwidth-bound. To produce one token, the model has to read essentially all of its weights out of RAM. A 4-billion-parameter model at 4 bits is roughly 2-3 GB. Generating 100 tokens means streaming those gigabytes from memory a hundred times over. Your cores spend most of that time waiting for data, not computing. This is the dirty secret of local LLM speed. Decode is usually limited by how fast you can move the model out of RAM, not by raw math. It’s also why quantization (shrinking the weights) speeds decoding up and why fast memory matters.

Of course, there are many caveats here, but let’s keep this lean :)

Two takeaways set up the rest of the post. First, since the latency you feel is dominated by prefill, and prefill cost is “how much prompt do I have to read”, the way to win is to read less prompt . Second, the part of the prompt we re-read every turn is exactly the part that doesn’t change. We shouldn’t be reading it at all.

The trick: stop reading what you already read

The model’s “understanding” of the prompt isn’t thrown away after prefill. It lives in a chunk of memory called the KV cache (key/value cache, the name doesn’t matter here). Think of it as the model’s working memory of everything it has read so far.

The insight that makes local models usable: if the start of this turn’s prompt is identical to last turn’s, the model’s working memory for that part is identical too. So don’t recompute it, reuse it. Snapshot that state and next turn restore the snapshot instead of prefilling from scratch. Restoring is basically a memory copy. In Fono it takes 15-40 milliseconds regardless of size. Prefilling the same content cold can take seconds (the right lane in the animation above). That’s the whole game.

There is one ironclad rule though, and it’s where I shot myself in the foot.

Reuse only works from the front

Cache reuse only works for a prefix that is byte-for-byte identical from the very first token . The...

Making Local LLM Fast

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs