Running a coding agent locally with MLX for when you hit your token cap

Running a coding agent locally for when the budget runs out | Bart de GoedeListen to this article instead

Your browser does not support the audio element In June 2026, Uber told its engineers they each would get a $1500 a month cap for AI coding tools and not a token more. That cap came after the company burned through its entire 2026 AI coding budget in four months: Claude Code adoption went from a third of engineers to 84% in about a quarter, power users were running $500–$2,000 a month, and the CTO ended up “back to the drawing board”. Uber wasn’t alone; Microsoft pulled Claude Code licenses from a division, Walmart capped its internal vibe-coding tool, and TechCrunch wrote about how “the token bill comes due”. The all-you-can-eat era is over, or nearing its end at least. Which raises a very practical question for any team: what do you do on the 22nd of the month, when the budget’s gone and there’s over a week of work left? “Everyone stops shipping” is a bad answer, and going back to artisanal coding unrealistic. I’ve been messing around with LM Studio, Ollama and others, so when I saw a YouTube video from WWDC about running a coding agent locally with MLX directly, I wanted to see if that would be better. Not because a 30B model on my laptop is going to replace Claude (lol) but because a local agent that’s good enough for the routine stuff is a decent insurance policy at least, and may even run some of the easy stuff. And it runs on hardware I already have (MacBook Pro with M4 Pro, 48 GB RAM), for the very agreeable price of free1. The three pieces# A local coding agent is three things talking HTTP: Pi (the agent) ──OpenAI API──> a model server ──> Qwen3-Coder-30B on MLX The model: Qwen3-Coder-30B-A3B, 4-bit. This is a “mixture-of-experts” model, which means that it has 30B parameters total, but only ±3B are ever activated per token, so the promise is that you get 30B-ish quality at ±3B speed. Tuned for agentic coding (multi-step tool use, editing files), which is what an agent actually does. And all that for ±17 GB of weights. The agent: Pi, a minimal coding agent (it can read, write and edit, and run bash, and not much else). It speaks plain OpenAI chat completions, which (foreshadowing) turns out to matter a lot. The server: this is the rabbit hole part. The agent was the easy choice. Pi is small, it’s provider-agnostic, and it doesn’t do anything clever with “billing headers” that break cache for Ollama or LM Studio. Pick Pi, point it at a server, done. First pass: the MLX server# Apple’s own MLX inference server (mlx_lm.server) is the most minimal, “official” thing. It speaks the OpenAI API, it’s one pip install, it’s maintained by the MLX team. How hard could it be? Right? Reader, there were Some Things™ that needed fixing. 1. You can’t quantize the KV cache. I wanted a big context window (coding agents constantly re-read files) but at 256K an fp16 KV cache is ±24 GB, which on top of ±17 GB of weights means the machine starts swapping and everything grinds to a screeching halt. The fix is an 8-bit KV cache. mlx_lm.server has no flag for it. The library it’s built on supports KV quantization perfectly well; the server just doesn’t expose it (yet)2. So: monkeypatch it is. 2. The patch silently did nothing. I patched the quantization in, tests passed, and the server kept using fp16 anyway. Turns out mlx_lm.server has two code paths: a single-request path (the one I initially patched) and a batched one built on a different class, which has no KV quantization and is the default for batchable models. My carefully-quantized cache was running at full fp16 the whole time, silently. I only caught it because I’d added logging that printed peak memory.3 Fix: force every request down the single-stream path. (Batching only helps with concurrent requests; I am not letting others run their prompts on my GPU, and my monkeybrain has a hard time with multiple coding agents running simultaneously anyway) 3. It crashed on the first real edit. This is the one that would’ve shipped if I’d trusted my smoke test. Checking tool-calling with a simple prompt — get_weather(city="Paris") — passed perfectly. Then I ran a better eval (more on that below) and every task crashed instantly : “stream ended without finish_reason”. The server was crashing mid-request, every time the agent tried to edit a file. When Pi edits code, the tool-call arguments contain the old and new code, ie strings with literal newlines in them . mlx-lm’s Qwen3-Coder tool parser tries json.loads (invalid because JSON wants escaped newlines), falls back to ast.literal_eval (also invalid, Python doesn’t like unterminated string literals)4, and when both fail, the exception takes down the request handler. My smoke test never caught it because city="Paris" is one line....

Running a coding agent locally with MLX for when you hit your token cap

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7