Local LLM Inference Optimization: The Complete Guide
Skip to main content
About the author
Kartikey Chauhan
Data analyst from Toronto. I write about analytics, local AI, homelabs, side projects, and whatever else catches my eye.
Outline
Join the conversation
Have thoughts, questions, or counterpoints? Drop them in the thread below.
Jump to comments
res.text())<br>.then(text => {<br>navigator.clipboard.writeText(text);<br>copied = true;<br>setTimeout(() => copied = false, 2000);<br>})<br>.catch(err => console.error('Failed to copy', err))
View Raw (.txt)
Note: This post was drafted with significant AI assistance, synthesizing notes, bench results, and scripts from the l3ms homelab toolkit and the series of model-running posts on this site. The experiments, numbers, and failure modes documented here are real - the synthesis and prose are AI-assisted.
Preface #
Over the past year I've written posts on running gpt-oss-120b, Qwen3-Coder-Next, Gemma 4 26B, Qwen3.6-35B-A3B, and Gemma 4 MTP locally on consumer hardware. Each post has its own notes, failure modes, and tuning results - but the same lessons keep appearing: enable XMP, pin to P-cores, quantize your KV cache, don't trust the power profile.
This is my attempt at a master reference. Instead of re-discovering flags in every new model post, I want one doc to link back to. If you're hitting a performance wall, starting from scratch, or just want to understand what each knob actually does - start here.
The scope is intentionally wide. We start from "should I even run locally?" and drill all the way down to CUDA environment variables and specific failure modes. Skip to wherever you're stuck.
1. TL;DR: Start Here #
If you want maximum control and performance: use llama.cpp directly. This guide assumes that path.
If you want desktop UX, model browsing, and a good local OpenAI-compatible endpoint: LM Studio is perfectly reasonable.
If you want multi-user serving, batching, and production throughput: evaluate vLLM.
If you are on Apple Silicon: compare llama.cpp Metal with mlx; unified memory changes the sizing math.
If TG is bad on MoE models: check RAM speed before touching flags. XMP/EXPO being off can cost 2-3x.
If you hit VRAM limits: reduce context, quantize KV cache, lower --parallel, then tune layer placement.
If you use MTP speculative decoding: benchmark draft acceptance and KV cache precision together; raw TPS is not enough.
If you are running a single-user homelab: prefer --parallel 1, explicit context sizing, and static placement once you have a stable config.
1.1 Where to Jump In #
This is a reference, not a linear tutorial. Start with the part that matches the problem:
MoE generation is slow: check RAM speed, then layer placement and P-core pinning.
The model does not fit or dies later in a session: start with --fit, context and KV cache, then the known OOM causes.
Vision fails at load or on the first image: go to Vision / Multimodal. The projector and image batch need their own headroom.
MTP is no faster than normal decoding: check draft acceptance and KV precision, not just reported TG.
You use LM Studio or Ollama: the hardware, OS, and security sections still apply. Most llama.cpp flags do not.
1.2 Safe Starting Profiles #
These are conservative baselines for a single-user server. They are starting points, not universal optimums; model architecture still changes the memory math.
Workload<br>--fit-target<br>Context<br>KV cache<br>--parallel<br>Batch
Text, 12 GB VRAM<br>512 MiB<br>64k<br>q8_0 / q8_0<br>1024
Text, 24 GB VRAM<br>512–768 MiB<br>128k<br>q8_0 / q8_0<br>1–2<br>1024
Vision, 12 GB VRAM<br>2048 MiB<br>64k<br>q8_0 / q8_0<br>256
MTP speculative decoding<br>512+ MiB<br>64k<br>f16 / f16<br>1024
Avoid the exciting failure modes: do not squeeze vision below --fit-target 2048 on a 12 GB card; do not enable GGML_CUDA_GRAPH_OPT=1 with less than 512 MiB headroom; do not include E-cores in a hybrid Intel CPU's thread range; and do not copy q8_0 KV settings into an MTP config without measuring draft acceptance. All four can look fine in a short benchmark and fail in a real session.
2. Optimization Priority Checklist #
Ordered by typical impact. Each item links to the section with the full explanation.
Action<br>Impact<br>Section
Enable XMP/EXPO in BIOS<br>2-3x TG on MoE<br>§6.1
Use MTP speculative drafting<br>2.0x-2.6x TG speedup<br>§18.1
Use QAT low-bit models (e.g. Q4 QAT)<br>Recovers much of the lost low-bit quality<br>§9.3
Run Linux or tune Windows power plan<br>~15-20% TPS<br>§7
Replace power-profiles-daemon with tuned-ppd<br>Eliminates intermittent 20-30% TG drop<br>§7.4
Build llama.cpp from source; keep updated<br>MoE kernel improvements per release<br>§8.2
Use --fit on for VRAM-optimal layer placement<br>Major TG; no manual tuning<br>§10.4
Use -ctk q8_0 -ctv q8_0 when not using MTP<br>Frees KV VRAM for extra GPU layers<br>§11.2
Keep KV cache at f16 for MTP unless tested otherwise<br>Preserves draft acceptance on tested Gemma 4 MTP configs<br>§18.2
10<br>Set --parallel 1 for single-user homelab<br>Reclaims KV VRAM for weights<br>§11.3
11<br>Pin...