How to Setup a Local Coding Agent on macOS - Kyle Howells
-->
-->
-->
I'd had my internet fail a few times recently leaving me stranded without a coding agent, and so when I saw the "Gemma 4 now runs 2x faster with MTP" Multi-Token Prediction update for Gemma 4 I decided to have a go at getting it running.
I wanted a local coding agent setup that:
was fast enough to actually use on my Mac
worked through an OpenAI compatible API (so I could use it in other tools)
and preferably could handle screenshots/images when needed, so I can feed it screenshots of what it has made.
And I did! This video is realtime. And shows the agent responding at a perfectly usable speed.
After a bit of testing the final setup I ended up with is:
llama.cpp built with Metal on macOS
Gemma 4 26B-A4B in GGUF format
A Q8 MTP draft model for speculative decoding
The Gemma 4 multimodal projector
Pi as the terminal coding agent
This was tested on an Apple M1 Max with 64 GB unified memory, running macOS 15.7.7.
The Model
The main model is: gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf.
Link on Huggingface: models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
That file is about 16 GB. With the MTP draft head and multimodal projector the model folder is about 17 GB.
The benchmark prompt was:
Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
Each benchmark generated about 128 tokens.
Baseline: llama.cpp + Metal
First I ran the main model directly through llama.cpp with Metal acceleration:
repos/llama.cpp/build/bin/llama-cli \<br>-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \<br>-ngl 999 \<br>-fa on \<br>-c 4096 \<br>-n 128
Result:
Setup<br>Prompt tok/s<br>Generation tok/s
Gemma 4 26B-A4B Q4, llama.cpp Metal<br>298.0<br>58.2
58 tokens/second is not fast, but is usable, but for coding-agent work you want it to be as fast as possible, especially when the agent is making many tool calls.
Adding the MTP Draft Model
Gemma 4 now has the MTP draft model available:
MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf
This can be loaded by llama.cpp as a speculative draft model:
repos/llama.cpp/build/bin/llama-cli \<br>-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \<br>--model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \<br>--spec-type draft-mtp \<br>--spec-draft-n-max 3 \<br>-ngl 999 \<br>-fa on \<br>-c 4096 \<br>-n 128
The first run with MTP came in at 69.2 tokens/second using 4 draft tokens. However, Unsloth's guide on How to Run MTP Models includes this note:
"We found --spec-draft-n-max 2 is the best starting point however, do not assume 2 is optimal, as performance is hardware-dependent. Try any value from 1 through 6 and use whichever is fastest for your system."
After sweeping --spec-draft-n-max, the best result was 72.2 tokens/second with 3 draft tokens.
Setup<br>Prompt tok/s<br>Generation tok/s<br>Speedup
Main model only<br>298.0<br>58.2<br>1.00x
Main model + Q8 MTP draft<br>295.6<br>72.2<br>1.24x
The useful part is that prompt processing stayed basically the same, while generation improved by about 24%.
Tuning MTP
I tested --spec-draft-n-max values from 1 to 6.
--spec-draft-n-max<br>Prompt tok/s<br>Generation tok/s
295.5<br>68.4
299.1<br>72.0
295.6<br>72.2
297.3<br>70.7
297.9<br>63.7
296.3<br>61.2
On my M1 Max machine, 3 was the fastest, with 2 close enough that either would be fine. Values above that got slower.
MLX Comparison
I also tested MLX models through mlx-lm, to find out which is the faster way to run the model on a Mac, llama.cpp or mlx.
Runtime<br>Model<br>Generation tok/s
llama.cpp Metal + MTP<br>Unsloth GGUF Q4 + Q8 MTP<br>72.2
llama.cpp Metal<br>Unsloth GGUF Q4<br>58.2
MLX-LM<br>Unsloth UD MLX 4-bit<br>45.8
MLX-LM<br>mlx-community 4-bit<br>43.9
MLX-LM<br>mlx-community OptiQ 4-bit<br>38.1
I thought MLX (being optimised for the Mac) would be fastest.
However, for this specific setup, llama.cpp was faster than MLX, and llama.cpp with MTP was clearly the best option.
I guess all the effort and tweaking which has gone into llama.cpp over time means it quite well optimised fr macOS despite being cross platform.
I also tried Gemma 4 MTP through gemma-4-swift-mlx, but the tested 26B 4-bit MLX checkpoints did not match the loader's expected weight keys, and I already had the previous MLX tests, so moved on rather than redownload new models and try to tweak things to match.
Adding Image Support
For Pi, I also wanted to be able to attach screenshots. The local model entry I setup for it originally declared the model as text-only:
"input": ["text"]
That meant Pi did not send image tool output through to the model properly.
The llama.cpp server also needs the Gemma 4 multimodal projector in order for the multi-modal part to work (only the 12B is natively multi-modal):
mmproj-BF16.gguf
When loaded with --mmproj, llama.cpp advertises multimodal support, and Pi can send images.
I re-ran the text benchmark with the projector loaded, just to...