Local AI Is Not Ready for Coding. Yet?

speckx1 pts0 comments

Local AI Is Not Ready for Coding. Yet? - Methodos Mechanicus

Skip to content

Blog<br>Highlights<br>Projects<br>About Me

In the GasTown post I made a throwaway claim: your worker agents — the polecats that actually write the code — don’t need your strongest model. Run them on Sonnet, not Opus, and your API bill stops looking like a car payment.

That got me thinking about the obvious next step. If a polecat is just a scoped worker that picks up a task and implements it, why pay for an API at all? Why not run it on a model on my own machine ? No per-token bill, no rate limits, no data leaving the building. For a fleet of workers that spin up dozens of times a day, “free and private” is a very loud pitch.

So I tested it. Properly — not “ask a local model to write fizzbuzz,” but drop a local model into our real production harness and tell it to go do a job, exactly like the cloud agents do.

The short version: the model that fits on your laptop can’t carry the job, and the model that can carry the job doesn’t fit on your laptop. The longer version is more interesting, because the local models were both more and less capable than I expected — sometimes in the same sentence.

The Setup#

I wanted a fair fight, so I gave the local models the exact same harness the cloud agents use.

Machine: a maxed-out MacBook Pro — M5 Max, 48-core GPU, 128 GB unified memory . This is as much model as consumer hardware will hold.

Serving: Ollama on the host, exposed over its OpenAI-compatible API, driven by the opencode harness so the agent gets real tools: bash, read, write, edit, grep.

The two contestants:

Devstral 24B — a dedicated local coding model (~25 GB at Q8). Small, fast, purpose-built.

Qwen3.5 122B-a10b — a 122-billion-parameter mixture-of-experts (~10B active), ~81 GB in memory at Q4. The most capable model I can physically load on this machine. When people say “run a frontier model locally,” this is roughly the ceiling of what that means on a laptop.

The harness: GasCity, the production successor to the GasTown setup I wrote about earlier. Same mental model — a Mayor that plans, polecats that implement, a Reviewer, a Refinery that merges. Work flows as beads (tasks) bundled into convoys, and a polecat runs a formula (a multi-step workflow) to take a task from “assigned” to “merged.”

The task itself was deliberately trivial: “Create a file with one line of specific text.” If a model can’t do that, nothing else matters. If it can, we learn where the wheels come off.

The control group

Before judging the local models, I ran the identical task through a Claude Sonnet polecat. It sailed through the whole pipeline — created the file, committed it, handed off to the Reviewer, got merged to main, closed the bead. No drama. That matters: it proves the harness, the dispatch, and the task are all sound. Everything that follows is the model’s contribution, not a broken setup.

Devstral (24B): Eager, Capable, and Can’t Spell#

Handed our full agent onboarding prompt — about 25 KB of role, rules, and protocol — Devstral did something I didn’t expect. It introduced itself.

“I’m here to assist you with software engineering tasks. To get started, please let me know what you need help with.”

Zero tool calls. It read 15,000 tokens of “you are an autonomous worker, here is how to find and execute your assigned task,” and concluded that the correct move was to wait politely for instructions. A frontier model treats that same prompt as a starting gun; Devstral treated it as a company handbook.

But here’s the twist that makes “local AI isn’t ready” too glib: when I stripped the ceremony and gave it a direct, explicit instruction — “create this file, use your write tool, then run ls” — it just did it. Tool call, file written, verified. About three times out of four. The capability is real; it only shows up when the ask is concrete.

And then, on the trivial task, it did this:

The model that renamed the deliverable

I asked for a file named MINSTRAL.md . Devstral created minstrel.md — silently “corrected” my spelling and lowercased it — then explained, with total confidence, that this was fine “since filenames are case-sensitive on Linux.”<br>That justification is not just wrong, it’s backwards. Case-sensitivity is exactly why MINSTRAL.md and minstrel.md are two different files. The model produced a plausible-sounding sentence to rationalize an error it didn’t notice it was making. This is the local-model failure mode in miniature: confident, fluent, and subtly off — on a task with one requirement.

For glue code and scoped edits where you’re checking the output anyway, a 24B coding model is genuinely useful. For anything you’re not going to read line-by-line, that confidence-without-correctness is a tax you pay later.

Qwen3.5 (122B): Genuinely Agentic — and Genuinely Lost#

The 122B model is a different animal, and this is the result that surprised me most.

Handed the same full onboarding prompt that made Devstral freeze, Qwen...

model local task devstral harness file

Related Articles