The 98% Problem: A Survey of Harness Engineering for AI Agents

gdss1 pts0 comments

The 98% Problem: A Survey of Harness Engineering for AI Agents — BeConfident Labs<br>(a)  trigger sequence(b)  maintenance opsAgent sessionLauncherGROOM agentwiki/checks: enabled?last run ≥ 24 h ago?skill activationreturns in spawn · detachedread all pagespage contentsedits · one opappend journal entrytconversation proceeds, never blockedlintFix the form.Never touch the meaning.pruneCut what repeats.Leave fewer lines behind.expandIngest what changed.Touch at most 3–6 files.researchAdmit new papers.Demand citations first.one op per cycle, chosen by config

AUTHORS<br>Gui Dávid<br>Author contributions statement below.

AFFILIATIONS<br>Head of AI at BeConfident

PUBLISHED<br>June 12, 2026

Below the Model<br>Frontier models converged between 2023 and 2026. For most production tasks, swapping one top model family for another no longer changes the outcome much. The systems that win differ a layer down: in how they call the model, what context they feed it, which tools they expose, where execution happens, and how they measure outcomes. Practitioners call that layer the harness .<br>98.4%<br>of Claude Code’s codebase is harness , not model: the context, tools, permissions, sandboxing, and recovery wrapped around the ~1.6% that is actual AI decision logic[17]. This is the 98% problem.

93%<br>of permission prompts get approved, which makes routine prompts unreliable as a safety control[17].

54<br>built-in tools, max, in Claude Code’s curated tool surface (19 always on)[17].

The first number gives this paper its title. A community dissection of Claude Code, the most documented production agent, estimates that about 1.6% of the code decides what the model does; the rest assembles context, dispatches tools, checks permissions, sandboxes execution, persists state, and recovers from failure[17]. Lines of code measure where the engineering lives, not where the capability lives, so treat the ratio as an order-of-magnitude claim. It still names a real condition: the 98% problem . The layer that decides agent quality is the one nobody benchmarks, few teams staff, and every project rebuilds from scratch.<br>Definition. Harness engineering is the design and operation of the control, execution, safety, evaluation, and training infrastructure that turns one or more models into a dependable agentic system. Prompts are one input to one call; the harness governs every call across a task.

The discipline now has primary literature. Anthropic published four engineering guides on agent design[3], context engineering[4], tool design[5], and long-running harnesses[6]. OpenAI wrote up Codex harness practice[2]. Böckeler published an early practitioner framework[1], and two academic teams dissected production systems end to end[17],[18]. Survey here means a synthesis of that primary engineering literature, not a systematic review.<br>One mental model organizes the field. Treat the harness as an operating system and the model as a process inside it. The OS decides what memory the process reads, which syscalls exist, which calls succeed, where execution happens, and what the process learns about the world. The pattern compresses to a rule: the model proposes, the harness disposes[3],[17]. A design that lets the model grant its own permissions has handed it root.<br>One disclosure up front: parts of this survey were researched and maintained by GROOM, the system its final sections describe.<br>Anatomy of a Harness<br>We decompose the production agents with public end-to-end dissections[17],[18] into the same eight subsystems around the model. Figure 1 maps them. Click a subsystem for its job, the pattern that makes it work, and the failure you get when you skip it.<br>MODEL<br>interchangeable component

Orchestrator / LoopContext EngineTools & MCPPermissionsSandboxMemorySub-agentsObservability & Evals<br>Click a subsystem to inspect it.

Context Engine<br>Curates the smallest set of high-signal tokens for each model call. Windows grew to 1M tokens but context rot — recall degradation as attention thins quadratically — did not vanish. Production systems run layered compaction: cheap trims first, full LLM summarization only under pressure, with full history preserved as an append-only record.<br>Key pattern:Append-only state, projection at read time — compaction is a view, not a write.<br>Canonical failure:Compaction-as-truncation: chopping history destroys architectural decisions and unresolved bugs.

Figure 1 · Anatomy of a harness (interactive). Eight subsystems around a small, swappable model.The Agent Loop<br>The runtime shape descends from ReAct[7]: assemble context, call the model, gate the proposal, execute in isolation, observe, repeat until a stop condition. Figure 2 shows where the harness inserts itself.<br>Assemblecompaction · retrievalModel callproposes an actionGaterisk score · approvalExecuteinside a sandboxObserveappend to tracedenied: the reason returns as text and the model re-plansloop until: done · budget exhausted · evaluator veto · abort

Figure 2 · The agent loop. Five lines of...

model harness engineering context call production

Related Articles