Agent Harness Engineering: A Survey
Agent Harness Engineering: A Survey
Junjie Li1,6,*,<br>Xi Xiao6,*,<br>Yunbei Zhang5,*,<br>Chen Liu2,*,<br>Lin Zhao4,<br>Xiaoying Liao3,<br>Yingrui Ji6,<br>Janet Wang5,<br>Jianyang Gu7,<br>Yingqiang Ge9,<br>Weijie Xu9,<br>Xi Fang9,<br>Xiang Xu9,<br>Tianchen Zhao9,<br>Youngeun Kim9,<br>Tianyang Wang6,<br>Jihun Hamm5,<br>Smita Krishnaswamy2,<br>Jun Huan9,†,<br>Chandan K Reddy8,9,†
1Carnegie Mellon University ·<br>2Yale University ·<br>3Johns Hopkins University ·<br>4Northeastern University ·<br>5Tulane University ·<br>6University of Alabama at Birmingham ·<br>7The Ohio State University ·<br>8Virginia Tech ·<br>9Amazon
*Equal contribution. †Corresponding authors.
Paper
OpenReview
Huggingface
GitHub
Website
BibTeX
A side-by-side comparison of prompt, context, and harness engineering.
Abstract
The harness is becoming the binding constraint.
The rapid deployment of large language model agents in production has revealed<br>a recurring pattern: task execution reliability depends less on the underlying<br>model than on the infrastructure layer that wraps it, the<br>agent execution harness.
This survey presents agent harness engineering as an independent system layer,<br>proposes the seven-layer ETCLOVG taxonomy (Execution, Tooling,<br>Context, Lifecycle, Observability, Verification, Governance), and maps a broad<br>corpus of open-source projects onto that taxonomy to expose ecosystem patterns,<br>coverage gaps, and emerging design principles.
Contributions
Three Claims
Claim 1<br>Harnesses are independent system layers.
Real-world reliability is shaped by execution controls, feedback loops,<br>governance, evaluation, and operational design, not only by model capability.
Claim 2<br>ETCLOVG separates production concerns.
Execution, Tooling, Context, Lifecycle, Observability, Verification, and<br>Governance expose architectural boundaries that earlier frameworks often<br>conflate.
Claim 3<br>A broad ecosystem map reveals gaps.
A systematic mapping of the open-source ecosystem surfaces adoption patterns<br>across sandboxes, protocols, memory systems, orchestrators, observability<br>platforms, benchmarks, and governance stacks.
Three Engineering Phases
Read across 2022–2026, agent engineering has gone through a coherent shift in<br>where the marginal effort lands. The three phases overlap in time and concept;<br>they describe what the field has chosen to engineer, not a clean sequence of<br>replacements.
2022–2024
Prompt engineering.<br>The primary lever is the input prompt text: instructions, few-shot examples,<br>and reasoning templates, all optimized for a single model call.
2025
Context engineering.<br>The question shifts from “what is the input?” to “what should<br>the model see at each step?” The scope expands to retrieval, compaction,<br>tool-result ranking, and managing context-window saturation across turns.
2026–
Harness engineering.<br>As models become capable enough to attempt long-running tasks, the engineering<br>focus expands to the full infrastructure wrapper: execution environment, tool<br>interface, context, lifecycle, observability, verification, and governance.
Timeline of Agent-Harness Systems
The same shift is visible in the systems themselves. The ReAct era of<br>2022–2023 wrapped a single model loop with a while-loop, a prompt template,<br>and a small tool dispatch table; AutoGPT and BabyAGI exposed the resulting failures,<br>including execution runaway, context blowout, state loss, and unmonitored side<br>effects, as infrastructure problems rather than prompt problems. Tool integration<br>and multi-agent coordination from 2023–2024 added learned tool use (Gorilla,<br>ToolLLM, Toolformer), role-playing organizations (CAMEL, ChatDev, MetaGPT,<br>Mixture-of-Agents), the first agent benchmarks (SWE-bench, AgentBench, WebArena,<br>GAIA), and the beginnings of protocol standardization (MCP, A2A). By 2025–2026<br>enough deployment experience had accumulated that “harness engineering”<br>began to be named as a discipline of its own, accompanied by automated harness<br>optimization and a wave of results in which only the harness was varied.
Representative agent-harness systems by ETCLOVG layer, 2022–2026.
The ETCLOVG Taxonomy
We organize the harness into seven layers. The first four describe the structural<br>core of a harness; the last three describe the control plane around it. Compared<br>with earlier six-component frameworks, Observability and Governance<br>appear here as independent layers because, in production deployments, each has its<br>own tooling stack and is owned by a different team.
The ETCLOVG taxonomy. E, T, C, and L form the structural pillars; O provides<br>system-wide monitoring; V delivers evaluation and feedback; G enforces governance<br>constraints across the system.
Execution environment.<br>Determines where agent code runs and what sandbox constraints bound it:<br>managed sandboxes, microVMs, code-specialized runtimes, computer-use<br>environments, browser sandboxes, and OS-level permission...