Building the harness around our coding agents. Eight failure modes and pillars

wek1 pts0 comments

Building the harness around our coding agents: eight failure modes, eight pillars | Nimbalyst Features<br>Core Features Extensions Mobile Why Nimbalyst

Pricing Resources<br>Docs Blog Agent Skills Use Cases Open Source About Compare Claude Code GUI Claude Code Desktop App Codex GUI Codex Desktop App

Download

GitHub

Teams building with AI usually end up building two products: the thing they ship, and the system around their agents that makes them useful in building the thing they ship.

We built such a system to help us ship Nimbalyst. We call it our team harness. This post is about what we learned from doing it.

What a harness is

A harness is the durable layer around a model: instructions, tools, permissions, context, and verification.

Claude Code and Codex are harnesses in this sense. Each wraps a model with a system prompt, a tool surface, a permission model, and an execution loop. Anthropic and OpenAI own that layer.

Your team owns the next layer up: the workspace where agents do product work alongside you, with your files, tasks, diagrams, diffs, and decisions. This layer carries the knowledge your team has accumulated: how you build things, what you already decided, what is connected to what, where the agent is allowed to act, and how it checks its own work.

The line between context and harness can blur. A ticket or spec is task-specific context, but the mechanism that makes that ticket searchable, linkable, versioned, and retrievable by any agent is part of the harness.

Almost nothing in a good harness is novel. It is mostly other people’s parts assembled around your project: Claude Code, Codex, MCP, Playwright, a tracker, a diagramming tool, an editor, a test runner, your repository, your docs. The harness is the way those pieces are put together so an agent can pull the right context for a task and verify what it produced.

Eight failure modes resulted in 8 pillars for our harness

We arrived at eight parts of our harness addressed failure modes of the coding agent.

Failure mode without the harnessPillar that answers itDoesn’t know your codebase, rules, decisions, or conventionsContext Can’t traverse the links between artifacts that already existProvenance Can’t act on the world or observe what it didCapability Reinvents how to do every taskWorkflow Does something dangerous because nothing stops itRestraint Hallucinates “fixed” without proofVerification Can’t show results back to humans in a useful formVisual interface The human can’t keep track of work happening across many agents in parallelCoordination The rest of this post walks each pillar and what we built for it.<br>1. Context

Goal: know the project.

Failure mode this answers: the agent doesn’t know your codebase, rules, decisions, or conventions, so it solves every problem like it has never seen this project before.

Context is everything specific to our project: code, specs, design docs, tracker items, data models, past decisions, conventions, examples, and recipes.

In our harness that means:

Code, specs, plans, and mockups live as local files in formats an agent can read and edit directly.

Architecture diagrams live as Excalidraw files instead of screenshots trapped in a slide deck.

Decisions are captured as tracker items, not buried in chat transcripts.

Bug histories are searchable, so the agent can see symptoms, root cause, and previous fixes.

Root instruction files like CLAUDE.md and AGENTS.md load at session start and point the agent at the rest.

Path-scoped rule files load only when the agent touches a relevant directory, so React rules show up for renderer code and Swift rules show up for the iOS package.

A skill system holds reusable instructions for recurring jobs: how we write tests, add analytics events, release a package, or debug a failing screen.

Persistent per-user memory captures preferences and validated approaches across sessions.

An agent editing renderer code loads React rules without loading iOS rules. An agent fixing a regression finds the prior bug, the root cause, and the fix before writing code. Each session starts with the team’s accumulated decisions already in scope instead of being re-derived from the prompt.

2. Provenance

Goal: trace the why.

Failure mode this answers: the agent can’t traverse the links between artifacts that already exist, so the reasoning behind every change has to be re-explained or rediscovered.

Provenance is how code changes stay linked to the intent that produced them. A persistent, typed record of why each change exists, navigable from any direction: from the file, from the session, from the tracker item, from the commit. The underlying data structure is a typed graph of links between artifacts; the value is being able to ask “why is this the way it is?” and get an answer.

In our harness that means:

A typed link graph between tracker items, plans, specs, diagrams, mockups, sessions, diffs, files, commits, and decisions.

First-class editors for those artifacts inside...

harness agent code failure from decisions

Related Articles