Show HN: Llama CPU Benchmarks

I tried 4 LLM speedup techniques on CPU. Three made it slower.

Appearance

Gemma quietly won.I tried 4 LLM speedups on CPU. 3 made it slower. 94% tool-calling accuracy. 6.2 s p50. Single Xeon, no GPU. Stock llama.cpp + Gemma-4-E4B-it beat every clever trick I threw at it — TurboQuant, speculative decoding, ik_llama.cpp, the lot. Read the article → Engine bake-off Raw results JSON

TurboQuant — "8× faster" The headline is a synthetic GPU-kernel number. On real CPU end-to-end it ran 2.2× slower and dropped Qwen accuracy 17 pp. Memory savings real; speed wins conditional.

Speculative decoding Published 2.5–3× CPU speedup is real — only on 7B+ targets. At 4B on a 4-core cgroup: 1.48× SLOWER. Draft + verify orchestration eats more than the drafts save.

ik_llama.cpp 1.53× faster end-to-end on Qwen via IQK matmul kernels. But parallel tool calls collapse 80% → 0% and Phi-4 will not even load. Hard veto for production.

Gemma-4-E4B-it quietly won 94.3% overall. 100% on multi-function AND parallel calls. 6.2 s p50. Beat Qwen 3.5 4B and Phi-4-mini-instruct outright. This is the ship recommendation.

Phi-4-mini ships broken Drop-in `--jinja` tool-calling = 0.0% pass — llama.cpp falls back to a prose parser. A 30-character system prompt rescues 74%. Lose parallel calls anyway.

What to actually ship Stock `ghcr.io/ggml-org/llama.cpp:full` + Gemma-4-E4B-it Q4_K_M + FP16 KV + `--jinja --reasoning off --reasoning-budget 0`. No fork. No quantized KV. No draft model.

In one paragraph Three ~4B open-weight tool-calling models (Qwen 3.5 4B, Google Gemma-4-E4B-it, Microsoft Phi-4-mini), four CPU speedup techniques (TurboQuant KV quantization, speculative decoding, ik_llama.cpp, OpenVINO/vLLM as outside references), one shared Xeon E-2176G box, 35 BFCL tool-calling cases per cell, full cgroup isolation, sanitized public artifacts. Eleven cells of measured pain. The TL;DR is in the hero. The story is in the article. The data is on /results. Where to go The article — narrative version, ~6 min, the one you share. Engine bake-off — deeper dive: stock vs specdec vs ik_llama.cpp, with numbers. Results table — all 11 cells, sortable, no prose. HTTP API — grab the JSON directly. Spec: TurboQuant bake-off · Spec: Engine bake-off — methodology and why each measurement was chosen. All MIT, all reproducible from a public repo.

Show HN: Llama CPU Benchmarks

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast