Apples® to Apples®: MLX vs llama.cpp for Gemma 4 12B - Ziraph blog
Beta version (soft) launch soon! Subscribe to the announcement list to hear the moment the website and beta open ↗
This is an engineering field report, structured like a paper for clarity - not a study. One M1, 16 GB, 3 runs per condition, dogfooding my own tool. No peer review, no claim of generality. The structure is for navigating the data, not a claim of rigor it doesn't have. Where a number is an upper bound or rests on a single machine, it says so.
The bottom line
Decode is a tie. At matched quant (~4.9 bpw on both sides), llama.cpp and mlx_lm generate at the same ~7-7.8 tok/s on this M1 - both pinned at 93-97% of the memory-bandwidth wall, where the engine can't move the number.
What differs is around decode, not decode. MLX's Python startup tax loses the one-shot short query (-22% wall clock) and it burns 2-4.5× the CPU energy throughout; GPU energy per token is a wash.
Pick by regime and chip, not "MLX is faster." llama.cpp for snappy / interactive, either for long batch - MLX's headline win is real on bandwidth-rich M-Max / Ultra silicon and evaporates on a bandwidth-starved M1 at matched precision.
Abstract
A fair fight between two local AI runtimes on the same workload and the same machine - Gemma 4 12B on a 16 GB M1 - at matched precision and under a stack of controls. The contestants are the lean mlx_lm runtime and raw llama.cpp (llama-cli), both running text-only at an effective ~4.9 bits per weight, both wrapped by ziraph and driven by one ziraph campaign file.
The going-in expectation was that MLX should win - that's what the literature sells on Apple Silicon. On this M1 it doesn't win, and it doesn't lose: decode is a tie. Short prompt or long, both runtimes generate tokens at the same ~7-7.8 tok/s, because both are pinned against the same memory-bandwidth wall (93-97% of M1's measured ceiling) and matched precision means matched bytes per token. The engine cannot change a bandwidth-bound number.
What does differ is everything around decode - a per-invocation startup tax (MLX's Python load is ~9 s against llama.cpp's ~3 s, enough to lose the one-shot short query) and a 2 to 4.5× CPU-energy bill from the Python orchestration layer - not decode itself, where GPU energy per token is a tie. And the headline "MLX is faster on Apple Silicon" turns out to be M4-class and baseline-dependent: most of the MLX "decode edge" I'd seen in an earlier pass against Ollama was the daemon's own overhead, not the engine.
The one-line story is that I wanted a fair "does MLX actually beat the Ollama/GGUF path for Gemma 4 on my Mac?" measurement, and getting to fair took three dead-ends and a 24 GB re-quantize I kept trying to avoid (§3.2). What follows is how the machine forced the answer out of me, one control at a time.
The result, up front - matched quant, short prompt, three runs a side, the full ziraph compare:
ziraph compare — Variant A (llama.cpp, Q4_K_M) vs Variant B (mlx_lm, mixed_4_6)<br>gemma-4-12B · short prompt · M1 16 GB · aggregate across 3 runs/side (± = across-run σ)
Variant A Variant B Δ Δ% Sig<br>energy_per_token J/tok 1.16 ± 0.13 1.37 ± 0.01 +0.21 +18% (A) ≈1σ<br>tokens_per_second tok/s 6.92 ± 0.45 5.38 ± 0.02 -1.54 -22% (A) SIG ⭐<br>reported decode tok/s 7.80 7.64 -0.16 -2% (A) —<br>non-decode overhead % 11.3 29.5 +18.2 +161% (A) —<br>─── Energy ───────────────────────────────────────────────────────────────<br>mean_ane J 0.0000 ± 0.0000 0.0000 ± 0.0000 +0.0000 — —<br>mean_gpu J 188 ± 22 191 ± 1 +3.159 +2% (A) ≈1σ<br>mean_cpu J 9.4 ± 1.6 42 ± 2 +32.33 +344% (A) SIG ⭐<br>duration s 24.58 31.58 +6.999 +28% (A) —<br>─── Average power ────────────────────────────────────────────────────────<br>mean_ane mW 0.0000 ± 0.0000 0.0000 ± 0.0000 +0.0000 — —<br>mean_gpu W 7.11 ± 0.47 6.12 ± 0.19 -0.9903 -14% (B) SIG ⭐<br>mean_cpu W 0.6423 ± 0.0185 1.27 ± 0.11 +0.6294 +98% (A) SIG ⭐<br>Media Engine W 0.0257 0.0260 +0.0003 +1% (A) —<br>peak Media Engine W 0.0313 0.0307 -0.0006 -2% (B) —<br>─── Bottleneck Hints ─────────────────────────────────────────────────────<br>Variant A: (no hints fired)<br>Variant B: (no hints fired)<br>Match — no hints fired on either side<br>Bandwidth (% of measured roofline):<br>A 97% of M1 ceiling (61.4 GB/s, calibrated 2026-06-05) [bus-saturation/amc-sum]<br>B 97% of M1 ceiling (61.4 GB/s, calibrated 2026-06-05) [bus-saturation/amc-sum]<br>ziraph compare output — short prompt, three runs a side, llama.cpp (Variant A) vs mlx_lm (Variant B).
Don't read it all yet. The story lives in three rows: reported decode is a tie (7.80 vs 7.64), the wall-clock tokens_per_second goes to llama.cpp (+22% , all of it the MLX startup tax in the non-decode overhead row), and mean_cpu J is +344% - MLX's real, consistent cost. By the end you'll read every cell at a glance, including the long-output table where the totals look like an MLX win and the per-token rows say otherwise.
1. Motivation & hypothesis
I wanted two things the day after Gemma 4 12B landed (2026-06-03, announced as a...