Show HN: Llama CPU Benchmarks

muthuishere1 pts0 comments

I tried 4 LLM speedup techniques on CPU. Three made it slower.

Skip to content

Appearance

Gemma quietly won.I tried 4 LLM speedups on CPU. 3 made it slower.<br>94% tool-calling accuracy. 6.2 s p50. Single Xeon, no GPU. Stock llama.cpp + Gemma-4-E4B-it beat every clever trick I threw at it — TurboQuant, speculative decoding, ik_llama.cpp, the lot.<br>Read the article →<br>Engine bake-off<br>Raw results JSON

TurboQuant — "8× faster"<br>The headline is a synthetic GPU-kernel number. On real CPU end-to-end it ran 2.2× slower and dropped Qwen accuracy 17 pp. Memory savings real; speed wins conditional.

Speculative decoding<br>Published 2.5–3× CPU speedup is real — only on 7B+ targets. At 4B on a 4-core cgroup: 1.48× SLOWER. Draft + verify orchestration eats more than the drafts save.

ik_llama.cpp<br>1.53× faster end-to-end on Qwen via IQK matmul kernels. But parallel tool calls collapse 80% → 0% and Phi-4 will not even load. Hard veto for production.

Gemma-4-E4B-it quietly won<br>94.3% overall. 100% on multi-function AND parallel calls. 6.2 s p50. Beat Qwen 3.5 4B and Phi-4-mini-instruct outright. This is the ship recommendation.

Phi-4-mini ships broken<br>Drop-in `--jinja` tool-calling = 0.0% pass — llama.cpp falls back to a prose parser. A 30-character system prompt rescues 74%. Lose parallel calls anyway.

What to actually ship<br>Stock `ghcr.io/ggml-org/llama.cpp:full` + Gemma-4-E4B-it Q4_K_M + FP16 KV + `--jinja --reasoning off --reasoning-budget 0`. No fork. No quantized KV. No draft model.

In one paragraph ​<br>Three ~4B open-weight tool-calling models (Qwen 3.5 4B, Google Gemma-4-E4B-it, Microsoft Phi-4-mini), four CPU speedup techniques (TurboQuant KV quantization, speculative decoding, ik_llama.cpp, OpenVINO/vLLM as outside references), one shared Xeon E-2176G box, 35 BFCL tool-calling cases per cell, full cgroup isolation, sanitized public artifacts. Eleven cells of measured pain. The TL;DR is in the hero. The story is in the article. The data is on /results.<br>Where to go ​<br>The article — narrative version, ~6 min, the one you share.<br>Engine bake-off — deeper dive: stock vs specdec vs ik_llama.cpp, with numbers.<br>Results table — all 11 cells, sortable, no prose.<br>HTTP API — grab the JSON directly.<br>Spec: TurboQuant bake-off · Spec: Engine bake-off — methodology and why each measurement was chosen.<br>All MIT, all reproducible from a public repo.

gemma tool llama slower calling turboquant

Related Articles