Running a coding agent locally with MLX for when you hit your token cap

bartdegoede1 pts0 comments

Running a coding agent locally for when the budget runs out | Bart de GoedeListen to this article instead

Your browser does not support the audio element<br>In June 2026, Uber told its engineers they each would get a $1500 a month cap for AI coding tools and not a token more. That cap came after the company burned through its entire 2026 AI coding budget in four months: Claude Code adoption went from a third of engineers to 84% in about a quarter, power users were running $500–$2,000 a month, and the CTO ended up &ldquo;back to the drawing board&rdquo;. Uber wasn&rsquo;t alone; Microsoft pulled Claude Code licenses from a division, Walmart capped its internal vibe-coding tool, and TechCrunch wrote about how &ldquo;the token bill comes due&rdquo;.<br>The all-you-can-eat era is over, or nearing its end at least. Which raises a very practical question for any team: what do you do on the 22nd of the month, when the budget&rsquo;s gone and there&rsquo;s over a week of work left? &ldquo;Everyone stops shipping&rdquo; is a bad answer, and going back to artisanal coding unrealistic.<br>I&rsquo;ve been messing around with LM Studio, Ollama and others, so when I saw a YouTube video from WWDC about running a coding agent locally with MLX directly, I wanted to see if that would be better. Not because a 30B model on my laptop is going to replace Claude (lol) but because a local agent that&rsquo;s good enough for the routine stuff is a decent insurance policy at least, and may even run some of the easy stuff. And it runs on hardware I already have (MacBook Pro with M4 Pro, 48 GB RAM), for the very agreeable price of free1.<br>The three pieces#<br>A local coding agent is three things talking HTTP:<br>Pi (the agent) ──OpenAI API──> a model server ──> Qwen3-Coder-30B on MLX<br>The model: Qwen3-Coder-30B-A3B, 4-bit. This is a &ldquo;mixture-of-experts&rdquo; model, which means that it has 30B parameters total, but only ±3B are ever activated per token, so the promise is that you get 30B-ish quality at ±3B speed. Tuned for agentic coding (multi-step tool use, editing files), which is what an agent actually does. And all that for ±17 GB of weights.<br>The agent: Pi, a minimal coding agent (it can read, write and edit, and run bash, and not much else). It speaks plain OpenAI chat completions, which (foreshadowing) turns out to matter a lot.<br>The server: this is the rabbit hole part.<br>The agent was the easy choice. Pi is small, it&rsquo;s provider-agnostic, and it doesn&rsquo;t do anything clever with &ldquo;billing headers&rdquo; that break cache for Ollama or LM Studio. Pick Pi, point it at a server, done.<br>First pass: the MLX server#<br>Apple&rsquo;s own MLX inference server (mlx_lm.server) is the most minimal, &ldquo;official&rdquo; thing. It speaks the OpenAI API, it&rsquo;s one pip install, it&rsquo;s maintained by the MLX team. How hard could it be? Right?<br>Reader, there were Some Things™ that needed fixing.<br>1. You can&rsquo;t quantize the KV cache. I wanted a big context window (coding agents constantly re-read files) but at 256K an fp16 KV cache is ±24 GB, which on top of ±17 GB of weights means the machine starts swapping and everything grinds to a screeching halt. The fix is an 8-bit KV cache. mlx_lm.server has no flag for it. The library it&rsquo;s built on supports KV quantization perfectly well; the server just doesn&rsquo;t expose it (yet)2. So: monkeypatch it is.<br>2. The patch silently did nothing. I patched the quantization in, tests passed, and the server kept using fp16 anyway. Turns out mlx_lm.server has two code paths: a single-request path (the one I initially patched) and a batched one built on a different class, which has no KV quantization and is the default for batchable models. My carefully-quantized cache was running at full fp16 the whole time, silently. I only caught it because I&rsquo;d added logging that printed peak memory.3 Fix: force every request down the single-stream path. (Batching only helps with concurrent requests; I am not letting others run their prompts on my GPU, and my monkeybrain has a hard time with multiple coding agents running simultaneously anyway)<br>3. It crashed on the first real edit. This is the one that would&rsquo;ve shipped if I&rsquo;d trusted my smoke test. Checking tool-calling with a simple prompt — get_weather(city="Paris") — passed perfectly. Then I ran a better eval (more on that below) and every task crashed instantly : &ldquo;stream ended without finish_reason&rdquo;.<br>The server was crashing mid-request, every time the agent tried to edit a file. When Pi edits code, the tool-call arguments contain the old and new code, ie strings with literal newlines in them . mlx-lm&rsquo;s Qwen3-Coder tool parser tries json.loads (invalid because JSON wants escaped newlines), falls back to ast.literal_eval (also invalid, Python doesn&rsquo;t like unterminated string literals)4, and when both fail, the exception takes down the request handler. My smoke test never caught it because city="Paris" is one line....

rsquo coding agent server ldquo rdquo

Related Articles