Agent Harness Engineering: A Survey

rippeltippel1 pts0 comments

Agent Harness Engineering: A Survey

Agent Harness Engineering: A Survey

Junjie Li1,6,&ast;,<br>Xi Xiao6,&ast;,<br>Yunbei Zhang5,&ast;,<br>Chen Liu2,&ast;,<br>Lin Zhao4,<br>Xiaoying Liao3,<br>Yingrui Ji6,<br>Janet Wang5,<br>Jianyang Gu7,<br>Yingqiang Ge9,<br>Weijie Xu9,<br>Xi Fang9,<br>Xiang Xu9,<br>Tianchen Zhao9,<br>Youngeun Kim9,<br>Tianyang Wang6,<br>Jihun Hamm5,<br>Smita Krishnaswamy2,<br>Jun Huan9,&dagger;,<br>Chandan K Reddy8,9,&dagger;

1Carnegie Mellon University &middot;<br>2Yale University &middot;<br>3Johns Hopkins University &middot;<br>4Northeastern University &middot;<br>5Tulane University &middot;<br>6University of Alabama at Birmingham &middot;<br>7The Ohio State University &middot;<br>8Virginia Tech &middot;<br>9Amazon

&ast;Equal contribution. &dagger;Corresponding authors.

Paper

OpenReview

Huggingface

GitHub

Website

BibTeX

A side-by-side comparison of prompt, context, and harness engineering.

Abstract

The harness is becoming the binding constraint.

The rapid deployment of large language model agents in production has revealed<br>a recurring pattern: task execution reliability depends less on the underlying<br>model than on the infrastructure layer that wraps it, the<br>agent execution harness.

This survey presents agent harness engineering as an independent system layer,<br>proposes the seven-layer ETCLOVG taxonomy (Execution, Tooling,<br>Context, Lifecycle, Observability, Verification, Governance), and maps a broad<br>corpus of open-source projects onto that taxonomy to expose ecosystem patterns,<br>coverage gaps, and emerging design principles.

Contributions

Three Claims

Claim 1<br>Harnesses are independent system layers.

Real-world reliability is shaped by execution controls, feedback loops,<br>governance, evaluation, and operational design, not only by model capability.

Claim 2<br>ETCLOVG separates production concerns.

Execution, Tooling, Context, Lifecycle, Observability, Verification, and<br>Governance expose architectural boundaries that earlier frameworks often<br>conflate.

Claim 3<br>A broad ecosystem map reveals gaps.

A systematic mapping of the open-source ecosystem surfaces adoption patterns<br>across sandboxes, protocols, memory systems, orchestrators, observability<br>platforms, benchmarks, and governance stacks.

Three Engineering Phases

Read across 2022–2026, agent engineering has gone through a coherent shift in<br>where the marginal effort lands. The three phases overlap in time and concept;<br>they describe what the field has chosen to engineer, not a clean sequence of<br>replacements.

2022–2024

Prompt engineering.<br>The primary lever is the input prompt text: instructions, few-shot examples,<br>and reasoning templates, all optimized for a single model call.

2025

Context engineering.<br>The question shifts from &ldquo;what is the input?&rdquo; to &ldquo;what should<br>the model see at each step?&rdquo; The scope expands to retrieval, compaction,<br>tool-result ranking, and managing context-window saturation across turns.

2026–

Harness engineering.<br>As models become capable enough to attempt long-running tasks, the engineering<br>focus expands to the full infrastructure wrapper: execution environment, tool<br>interface, context, lifecycle, observability, verification, and governance.

Timeline of Agent-Harness Systems

The same shift is visible in the systems themselves. The ReAct era of<br>2022–2023 wrapped a single model loop with a while-loop, a prompt template,<br>and a small tool dispatch table; AutoGPT and BabyAGI exposed the resulting failures,<br>including execution runaway, context blowout, state loss, and unmonitored side<br>effects, as infrastructure problems rather than prompt problems. Tool integration<br>and multi-agent coordination from 2023–2024 added learned tool use (Gorilla,<br>ToolLLM, Toolformer), role-playing organizations (CAMEL, ChatDev, MetaGPT,<br>Mixture-of-Agents), the first agent benchmarks (SWE-bench, AgentBench, WebArena,<br>GAIA), and the beginnings of protocol standardization (MCP, A2A). By 2025–2026<br>enough deployment experience had accumulated that &ldquo;harness engineering&rdquo;<br>began to be named as a discipline of its own, accompanied by automated harness<br>optimization and a wave of results in which only the harness was varied.

Representative agent-harness systems by ETCLOVG layer, 2022–2026.

The ETCLOVG Taxonomy

We organize the harness into seven layers. The first four describe the structural<br>core of a harness; the last three describe the control plane around it. Compared<br>with earlier six-component frameworks, Observability and Governance<br>appear here as independent layers because, in production deployments, each has its<br>own tooling stack and is owned by a different team.

The ETCLOVG taxonomy. E, T, C, and L form the structural pillars; O provides<br>system-wide monitoring; V delivers evaluation and feedback; G enforces governance<br>constraints across the system.

Execution environment.<br>Determines where agent code runs and what sandbox constraints bound it:<br>managed sandboxes, microVMs, code-specialized runtimes, computer-use<br>environments, browser sandboxes, and OS-level permission...

harness engineering agent middot execution context

Related Articles