I restarted a 10 year old Xeon 174 times to delete twelve flags and gain four tokens a second - point.freeSearch
Published on June 15, 2026I restarted a 10 year old Xeon 174 times to delete twelve flags and gain four tokens a second<br>22 minutes read
A follow-up to running Gemma 4 on a 2016 Xeon. I took that 25-flag config apart one flag at a time, to find which ones actually do the work, which are harmful, and which are just pitfalls. Most of my tuned config will do nothing for the typical user.A couple of weeks ago I got Gemma 4, a 26-billion-parameter model, running at reading speed on a 2016 Xeon with no GPU and 128 GB of DDR3. That post spent about eight hours on the front page of Hacker News, which means a lot of people now have a 25-flag command sitting in a terminal somewhere, copied from a blog, with no real idea which of those flags is doing the work.I have some bad news about that command. You’re likely holding it wrong.What I said in that post in passing, is that half of those flags would not take just by being present. Some need the right hardware. Some need the right host setup. Some only help on the right workload. The engine accepts all of them and tells you almost nothing about which ones actually fired. So this post is me going back and finding out what will actually work for YOU.The way you find out is an ablation. You take the working config, switch off exactly one flag, measure what changes, put it back, and do the same for every flag in turn. The word is borrowed from neuroscience by way of machine learning, where it normally means knocking out a piece of the model itself, an attention head or a layer, to see what it was for. I am using it a little improperly here, for inference flags rather than model internals, but the idea carries over: turn one thing off, measure, repeat, and the differences tell you what each piece was worth.It was a lot of work. Like a lot. One hundred and seventy-four runs, each one a fresh server reloading twenty-five gigabytes of weights off a spinning disk before it can answer a single token. Three prompts per launch, several repetitions each, and one entire overnight run I had to throw in the bin over a deadlock I will get to. I am telling you the count because the count is the point. The reason nobody knows which of these flags matter is that finding out is slow and tedious, so almost nobody does it. I did, so here is the answer.The setup<br>The box is the same one from the Xeon post. A Xeon E5-2620 v4, eight physical cores, sixteen threads, 128 GB of DDR3, no GPU, no swap. The engine is ik_llama.cpp on the feat/gemma-4-mtp branch. The verifier is gemma-4-26B-A4B-it at Q8_0, paired with its MTP drafter at Q8_0.I’ll use 3 test prompts:a short chat turna roughly five-thousand-token document to summarizea code generation request.Greedy decoding, fixed seed, 256 new tokens, three repetitions, median reported.Two things to clear out of the way before the numbers.I ran the benchmark under llama-server rather than the llama-cli from the original command. The server is where the per-request speculative-decoding telemetry lives, which is the only clean way to check whether the drafter actually fired on a given request, and it is what the upstream pull request drives its own benchmarks through. Nothing about the config changes, just the harness around it.And every number below is the full config with one flag changed, so each delta is that flag’s contribution given everything else is still on, not its effect in a vacuum. Flags interact. Speculation changes how much the thread count matters. Repacking changes what flash attention has to read. The deltas do not add up to the gap between this config and a naive one, and I will not pretend they do.The whole board<br>Here is every lever that moves the needle, in one place, before I walk through them. Decode speed, tokens per second, median of the repetitions, each row the full config with exactly one thing changed. The percentage is against the published config, so a flag that helps shows up as a loss when you remove it.lever changedchatlong doccodepublished config (autotune drafter)12.3 6.8 15.6 drafter, fixed draft 115.6 (+27%)9.8 (+44%)15.9 (+2%)drafter, fixed draft 216.1 (+31%)7.4 (+8%)17.8 (+14%)drafter, fixed draft 3, autotune off13.9 (+13%)7.1 (+4%)18.3 (+17%)drafter off entirely12.2 (wash)10.5 (+54%)12.2 (−22%)flash attention off6.7 (−46%)4.3 (−37%)7.5 (−52%)threads -t 47.9 (−35%)5.0 (−28%)9.5 (−39%)threads -t 1610.8 (−12%)7.1 (+3%)16.2 (+4%)run-time-repack off11.8 (−4%)9.2 (+35%)13.7 (−12%)Everything else I tried, --mla-use, --cpu-moe, --merge-up-gate-experts, --no-kv-offload, the -sm graph cluster, lands within a few percent of the published config on chat and code, which is to say inside the noise. The long-document column is messier than the other two and I will come back to why in the repack section. Read the negatives as the interesting part. Most of these levers do nothing or cost you, and the work is...