What Breaks When You Skip the Harness

What Breaks When You Skip the Harness | by Ian Johnson | Jun, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

Ian Johnson

8 min read· Just now

Listen

The agent kept running the tests, watching them go red, scrolling up to find the failure, and then running the tests again because it had already lost the output. Third time in one session. The model wasn’t the problem. The session’s memory of its own output was. I had Claude tee the test command to a log file. After that it read the log instead of re-running. That was one fix on one project. The pattern repeats. If the model keeps producing bad code, what is actually broken?

The model isn’t the problem Output quality is the sum of two things: the model and the surroundings. In the teams I’ve worked with, the obsession is always the model. They upgrade. Switch tools. New IDE plugin every month. They read every benchmark. The defects keep landing in the same places. The surroundings get almost no attention. Those surroundings are the harness: the files, rules, tools, and feedback loops that wrap the model inside a real project. A CLAUDE.md with conventions. Skills — codified workflows the agent can run by name. MCP servers. Scoping rules. A feedback.md that records what the team has corrected. Pre-commit gates. Without those, the model is guessing. Five things teams blame on the model, each pointing at a missing piece of the harness, each with a first move you can ship this week. Some of the bigger fixes (codifying a team’s lore, writing a useful project-specific MCP, etc.) are quarter-of-work to do properly. The first move is week-sized. The first move gets you started.

APIs that don’t exist The model writes a call to client.users.list({ since: lastSyncAt }). The argument doesn’t exist. The real API takes a Unix timestamp on a different endpoint. Code compiles. Test fails. The engineer reads the docs. The bug gets filed as “the model hallucinated.” Except the model didn’t hallucinate from nothing. It wrote what it learned from training data and didn’t verify. Nothing in the session told it to. The fix is grounding. Give the model a way to read the real docs in the moment. Two grounding moves do the work. First: an MCP server for library docs — Context7 is one. The model fetches current docs and writes against what’s there. When it fails, it’s almost always because the rule telling the agent to fetch first wasn’t in CLAUDE.md. Second: a project-specific MCP for internal APIs. Every team I’ve worked with has one or two services nobody got around to documenting well. An MCP that serves the OpenAPI spec — or a search over the service’s source — puts real signatures in front of the model. A single rule in CLAUDE.md ties it together. “Fetch docs before writing library code.” Five words. The agent reads them at the start of every session. Smallest first move: add Context7. Add the rule. Try it on the next library task.

Code that doesn’t match yours The model writes a new service. The file lands in the wrong directory. Errors throw instead of returning. The naming is camelCase in a snake_case codebase. The diff looks foreign. The reviewer sees it and rewrites it. The next PR has the same problems. The model has no reference for “how we do it here.” The rules live in three engineers’ heads and one out-of-date README. Part one is a CLAUDE.md that names the project’s conventions in plain words. Where files go. How errors propagate. Naming style. Testing framework. Logger. Time library. One or two sentences each. The bare minimum. Part two matters more: worked examples. When we moved a service to hexagonal architecture, the agent kept writing new code on the wrong side of the ports-and-adapters boundary. The old code was still everywhere and gave it permission to keep doing what it had always done. The model copies texture from examples better than from prose. A skill or a feature doc with one full worked example , such as a real PR, a real file, or a real test, gives it something to imitate. Three examples beat a thousand words of rules. The catch: if the codebase itself is inconsistent (legacy module is snake_case, new code is camelCase), the model picks up the contradiction and writes both. The worked example has to come from the side of the codebase you want the agent to copy. Smallest first move: pick the workflow your team runs most often. Write one skill for it. Drop one full example into the file.

The same bug, in every session You correct the agent on Monday. It used the wrong logger. You tell it to use the project’s logger. It does. Friday, new task, wrong logger again. The agent didn’t learn. The session ended. The correction went with it. The fix is a feedback loop the agent can read. Not a chat log, but a file. Get Ian Johnson’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign...

What Breaks When You Skip the Harness

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI