Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

Apples® to Apples®: MLX vs llama.cpp for Gemma 4 12B - Ziraph blog

Beta version (soft) launch soon! Subscribe to the announcement list to hear the moment the website and beta open ↗

This is an engineering field report, structured like a paper for clarity - not a study. One M1, 16 GB, 3 runs per condition, dogfooding my own tool. No peer review, no claim of generality. The structure is for navigating the data, not a claim of rigor it doesn't have. Where a number is an upper bound or rests on a single machine, it says so.

The bottom line

Decode is a tie. At matched quant (~4.9 bpw on both sides), llama.cpp and mlx_lm generate at the same ~7-7.8 tok/s on this M1 - both pinned at 93-97% of the memory-bandwidth wall, where the engine can't move the number.

What differs is around decode, not decode. MLX's Python startup tax loses the one-shot short query (-22% wall clock) and it burns 2-4.5× the CPU energy throughout; GPU energy per token is a wash.

Pick by regime and chip, not "MLX is faster." llama.cpp for snappy / interactive, either for long batch - MLX's headline win is real on bandwidth-rich M-Max / Ultra silicon and evaporates on a bandwidth-starved M1 at matched precision.

Abstract

A fair fight between two local AI runtimes on the same workload and the same machine - Gemma 4 12B on a 16 GB M1 - at matched precision and under a stack of controls. The contestants are the lean mlx_lm runtime and raw llama.cpp (llama-cli), both running text-only at an effective ~4.9 bits per weight, both wrapped by ziraph and driven by one ziraph campaign file.

The going-in expectation was that MLX should win - that's what the literature sells on Apple Silicon. On this M1 it doesn't win, and it doesn't lose: decode is a tie. Short prompt or long, both runtimes generate tokens at the same ~7-7.8 tok/s, because both are pinned against the same memory-bandwidth wall (93-97% of M1's measured ceiling) and matched precision means matched bytes per token. The engine cannot change a bandwidth-bound number.

What does differ is everything around decode - a per-invocation startup tax (MLX's Python load is ~9 s against llama.cpp's ~3 s, enough to lose the one-shot short query) and a 2 to 4.5× CPU-energy bill from the Python orchestration layer - not decode itself, where GPU energy per token is a tie. And the headline "MLX is faster on Apple Silicon" turns out to be M4-class and baseline-dependent: most of the MLX "decode edge" I'd seen in an earlier pass against Ollama was the daemon's own overhead, not the engine.

The one-line story is that I wanted a fair "does MLX actually beat the Ollama/GGUF path for Gemma 4 on my Mac?" measurement, and getting to fair took three dead-ends and a 24 GB re-quantize I kept trying to avoid (§3.2). What follows is how the machine forced the answer out of me, one control at a time.

The result, up front - matched quant, short prompt, three runs a side, the full ziraph compare:

ziraph compare — Variant A (llama.cpp, Q4_K_M) vs Variant B (mlx_lm, mixed_4_6) gemma-4-12B · short prompt · M1 16 GB · aggregate across 3 runs/side (± = across-run σ)

Variant A Variant B Δ Δ% Sig energy_per_token J/tok 1.16 ± 0.13 1.37 ± 0.01 +0.21 +18% (A) ≈1σ tokens_per_second tok/s 6.92 ± 0.45 5.38 ± 0.02 -1.54 -22% (A) SIG ⭐ reported decode tok/s 7.80 7.64 -0.16 -2% (A) — non-decode overhead % 11.3 29.5 +18.2 +161% (A) — ─── Energy ─────────────────────────────────────────────────────────────── mean_ane J 0.0000 ± 0.0000 0.0000 ± 0.0000 +0.0000 — — mean_gpu J 188 ± 22 191 ± 1 +3.159 +2% (A) ≈1σ mean_cpu J 9.4 ± 1.6 42 ± 2 +32.33 +344% (A) SIG ⭐ duration s 24.58 31.58 +6.999 +28% (A) — ─── Average power ──────────────────────────────────────────────────────── mean_ane mW 0.0000 ± 0.0000 0.0000 ± 0.0000 +0.0000 — — mean_gpu W 7.11 ± 0.47 6.12 ± 0.19 -0.9903 -14% (B) SIG ⭐ mean_cpu W 0.6423 ± 0.0185 1.27 ± 0.11 +0.6294 +98% (A) SIG ⭐ Media Engine W 0.0257 0.0260 +0.0003 +1% (A) — peak Media Engine W 0.0313 0.0307 -0.0006 -2% (B) — ─── Bottleneck Hints ───────────────────────────────────────────────────── Variant A: (no hints fired) Variant B: (no hints fired) Match — no hints fired on either side Bandwidth (% of measured roofline): A 97% of M1 ceiling (61.4 GB/s, calibrated 2026-06-05) [bus-saturation/amc-sum] B 97% of M1 ceiling (61.4 GB/s, calibrated 2026-06-05) [bus-saturation/amc-sum] ziraph compare output — short prompt, three runs a side, llama.cpp (Variant A) vs mlx_lm (Variant B).

Don't read it all yet. The story lives in three rows: reported decode is a tie (7.80 vs 7.64), the wall-clock tokens_per_second goes to llama.cpp (+22% , all of it the MLX startup tax in the non-decode overhead row), and mean_cpu J is +344% - MLX's real, consistent cost. By the end you'll read every cell at a glance, including the long-output table where the totals look like an MLX win and the per-token rows say otherwise.

1. Motivation & hypothesis

I wanted two things the day after Gemma 4 12B landed (2026-06-03, announced as a...

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy