6. Air-Gapped Claude Code - The Claude Code SRE Handbook
Skip to content
Initializing search
claude-code-sre-handbook
Part 2 — OSS Models
The pace is set by hardware
The 32K window — and what more memory buys
What persists, what lifts, and who this is for
7. From Investigation to PR, Air-Gapped
8. Air-Gapped Qwen3.6 on k8s-ai-bench
Part 3 — Context
What's Next
The pace is set by hardware
The 32K window — and what more memory buys
What persists, what lifts, and who this is for
06 — Air-Gapped Claude Code¶
The setup, the fixes that make it work, and the hardware that sets the pace
Claude Code connects to a model running locally on the laptop. You provide a Kubernetes incident for investigation. After ten minutes, Claude Code times out before producing any results. The model didn't use any integrated tools—it spent the entire allowed session thinking.
It loads. It doesn't work yet.
Four fixes later, the same laptop took an incident from investigation to an open pull request — found the root cause, wrote the patch, pushed a branch, filed the PR with gh — with nothing leaving the machine. It took its time. But it closed the loop. The gap between loads and works is those four fixes, and once you clear them, the thing that separates a 34-minute session from a fast one is hardware, not approach.
Before we begin, here's what's ahead: a step-by-step setup, the four crucial fixes, and a clear explanation of how your hardware affects speed. Let's get started.
Why local at all¶
One reason matters: data can't cross the firewall. In regulated environments and air-gapped clusters, a local harness isn't a preference—it's required. The good news: it works.
Everyone else is reading for the trade. Local buys you privacy and a flat cost. It bills you at latency and a model smaller than the frontier — but, as the completed loop above shows, capability is not what you give up on a task like this. Speed is. Whether that trade fits is what Part 2 answers across three posts. This one establishes what it takes to run, and what your hardware decides.
The stack¶
State the rig. All numbers depend on it.
Hardware: Apple M3 Pro, 18 GPU cores, 36 GiB unified memory, ~150 GB/s memory bandwidth.
Model: qwen3.6:35b-a3b-coding-nvfp4 — 35.1B parameters, mixture-of-experts with ~3B active per token, NVFP4 quantization. 21 GB on disk, ~20 GiB resident once loaded.
Runtime: Ollama 0.24.0, MLX runner (Apple's Silicon-native backend, not the llama.cpp/Metal path).
Client: Claude Code v2.1.84, pointed at the local Ollama endpoint.
MoE lets a 35B model run locally. Only ~3B active per token, so costs resemble a 14B dense model, while answers approach 35B. A dense 35B doesn't fit 36 GiB.
The tuned environment:
Variable<br>Value<br>Why
OLLAMA_MLX<br>Use the Apple Silicon MLX runner, not the llama.cpp/Metal backend
OLLAMA_CONTEXT_LENGTH<br>32768<br>What 36 GiB allows — more memory raises this; see below
OLLAMA_FLASH_ATTENTION<br>Lower attention memory
OLLAMA_MULTIUSER_CACHE<br>Reuse the prefix cache across requests
OLLAMA_KEEP_ALIVE<br>24h<br>Keep the 20 GiB model resident; reloads are slow
From zero to a working session¶
Start to finish: assumes Apple Silicon and kubectl pointed to your cluster.
1. Install Ollama, confirm the version.
ollama --version # must be 0.24.0 or newer — see fix #2 below
2. Pull the model. One-time, ~21 GB.
ollama pull qwen3.6:35b-a3b-coding-nvfp4
3. Serve with the tuned environment. Leave it running.
OLLAMA_MLX=1 \<br>OLLAMA_CONTEXT_LENGTH=32768 \<br>OLLAMA_FLASH_ATTENTION=1 \<br>OLLAMA_MULTIUSER_CACHE=1 \<br>OLLAMA_KEEP_ALIVE=24h \<br>OLLAMA_NO_CLOUD=1 \<br>ollama serve
For permanence, configure these in a launchd plist. Ollama runs as a service, not just your terminal.
4. Point Claude Code at the local model and launch from your working directory:
ANTHROPIC_BASE_URL=http://localhost:11434 \<br>MAX_THINKING_TOKENS=0 \<br>claude --model qwen3.6:35b-a3b-coding-nvfp4
No ANTHROPIC_API_KEY in the environment — not having this key is what makes Claude Code use the local model endpoint instead of contacting Anthropic's cloud service.
5. First prompt — a smoke test, not the main event. Pick something trivial that forces exactly one tool call:
Run kubectl get pods -A and tell me if anything appears unhealthy.
What you'll see: the first tool call happens in a few seconds (when thinking is disabled), then you may wait about 60 seconds as the model performs prefill (prefill means loading all necessary input data such as the prompt and context into memory for the model to start generating responses, which for this setup is about 25,000 tokens). After prefill, you get the answer. Subsequent sessions ('turns,' or interactions between the user and the model) are faster because the prefix cache stores the static parts of the prompt so they don't need to be reloaded. The burst of 404 errors shown in the Ollama log during this process is normal (addressed in fix #4).
6. Confirm nothing left the machine. The server printed "Ollama cloud...