Running Claude Code and Pi on DeepSeek V4 Flash — locally on a 128GB MacBook Pro
← cd .. A 284-billion-parameter frontier model, running entirely offline on a laptop — and wired up as a backend for two agent harnesses: Claude Code and Pi.
DeepSeek V4 Flash dropped in April 2026: a 284B-parameter Mixture-of-Experts model (13B active per token), MIT-licensed, with a 1M-token context window. The interesting part for me wasn’t the benchmarks — it was the claim, floating around the internet, that you could run it locally on an Apple Silicon Mac with enough RAM.
I have a MacBook Pro with an M3 Max and 128GB of unified memory. So I tried it. Here’s everything that worked, everything that didn’t, and the scripts I ended up with.
TL;DR
It works. ~21 tokens/sec generation, fully on the Metal GPU, ~81GB resident.
You cannot use mainline llama.cpp or Ollama yet — the deepseek4 architecture isn’t merged. You need antirez’s experimental fork .
The model file is an 81GB 2-bit “Dwarf Star” quant from antirez/deepseek-v4-gguf, purpose-built for 128GB Macs.
llama-server now speaks the Anthropic Messages API natively , so you can point Claude Code at it with zero proxies.
1M context loads but crashes at inference; 256k is the reliable ceiling on this fork.
The hardware
Chip: Apple M3 Max (12 performance + 4 efficiency cores)<br>Memory: 128 GB unified<br>The 128GB is the whole ballgame. The 2-bit quant needs ~81GB resident, which means a 64GB machine is out — you’d swap to death or OOM. 128GB is the sweet spot the quant was designed around. (There’s a bigger Q4 variant at 153GB for the 192GB Mac Studios, and DeepSeek-V4-Pro quants too, but Flash-q2 is the one that fits a laptop.)
False start: the guide that didn’t work
I started from a tutorial that told me to git clone mainline llama.cpp, build it, and huggingface-cli download /deepseek-v4-flash. Two problems:
Mainline llama.cpp doesn’t support DeepSeek V4. The deepseek4 architecture — with its sparse attention, hyper-connections, and multi-token-prediction head — isn’t in stable releases. Ollama doesn’t support it either (it’ll auto-update once the arch merges upstream, but that hadn’t happened).
The download command had a literal placeholder for the repo. There was no real source behind it.
So if you find a tutorial telling you to use stock llama.cpp or ollama pull deepseek-v4, close the tab. As of mid-2026 that path does not exist.
What actually works: antirez’s fork
Salvatore “antirez” Sanfilippo (creator of Redis) maintains an experimental llama.cpp fork that implements the deepseek4 architecture, plus a HuggingFace repo of matching GGUF quants. The key file:
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf (81 GB)<br>That filename is a recipe. It’s IQ2_XXS (2-bit) for the routed experts — which is where almost all 284B parameters live — but keeps the attention projections, shared experts, and output layer at Q8 . The parts that matter for coherence stay high-precision; the giant sparse expert tables get crushed to 2 bits. antirez calls it the “Dwarf Star” quant. His own note: “behaves very very well in the chat, frontier-model vibes, but it was not extensively tested.” That matches my experience.
Building it is standard llama.cpp:
git clone --depth 1 https://github.com/antirez/llama.cpp-deepseek-v4-flash llama.cpp<br>cd llama.cpp<br>cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release<br>cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)<br>This gives you llama-cli, llama-server, and llama-completion. The build detected my M3 Max GPU correctly:
ggml_metal_device_init: GPU name: MTL0 (Apple M3 Max)<br>ggml_metal_device_init: has unified memory = true<br>ggml_metal_device_init: recommendedMaxWorkingSetSize = 115448.73 MB<br>That ~115GB working-set ceiling is the number to keep in mind: the model eats ~83GB of it, leaving ~32GB for context and compute buffers.
Things that nearly fooled me
”It’s running on the CPU!” (it wasn’t)
My first test generation seemed to hang. top showed the process pegged at 99% on a single core for 19 minutes with no output. I was convinced the custom DeepSeek ops (the sparse-attention “indexer”, the “compressor”) had no Metal kernels and were falling back to CPU.
They weren’t. Two things were happening:
I’d piped the output through tail, which buffers until the process exits — so I saw nothing while it generated fine.
The 99%-single-core is just the orchestration thread spinning while the GPU does the matmuls. The real proof came from the memory breakdown:
| memory breakdown [MiB] | total free self ... |<br>| MTL0 (Apple M3 Max) | 110100 = 26265 + (83161 ...) |<br>83GB sitting on MTL0 — the Metal GPU. It was on the GPU the whole time. Lesson: don’t pipe a streaming LLM through tail , and check the memory breakdown before blaming the CPU.
Speed and load time
Generation: ~21 tok/s. Prompt eval: ~32–43 tok/s.
Cold load: ~9 minutes (reading 81GB off disk). Warm load: ~4 seconds once the file is in the OS page...