Agent Harness Engineering: A Survey

Junjie Li1,6,&ast;, Xi Xiao6,&ast;, Yunbei Zhang5,&ast;, Chen Liu2,&ast;, Lin Zhao4, Xiaoying Liao3, Yingrui Ji6, Janet Wang5, Jianyang Gu7, Yingqiang Ge9, Weijie Xu9, Xi Fang9, Xiang Xu9, Tianchen Zhao9, Youngeun Kim9, Tianyang Wang6, Jihun Hamm5, Smita Krishnaswamy2, Jun Huan9,&dagger;, Chandan K Reddy8,9,&dagger;

1Carnegie Mellon University · 2Yale University · 3Johns Hopkins University · 4Northeastern University · 5Tulane University · 6University of Alabama at Birmingham · 7The Ohio State University · 8Virginia Tech · 9Amazon

&ast;Equal contribution. &dagger;Corresponding authors.

Paper

OpenReview

Huggingface

GitHub

Website

BibTeX

A side-by-side comparison of prompt, context, and harness engineering.

Abstract

The harness is becoming the binding constraint.

The rapid deployment of large language model agents in production has revealed a recurring pattern: task execution reliability depends less on the underlying model than on the infrastructure layer that wraps it, the agent execution harness.

This survey presents agent harness engineering as an independent system layer, proposes the seven-layer ETCLOVG taxonomy (Execution, Tooling, Context, Lifecycle, Observability, Verification, Governance), and maps a broad corpus of open-source projects onto that taxonomy to expose ecosystem patterns, coverage gaps, and emerging design principles.

Contributions

Three Claims

Claim 1 Harnesses are independent system layers.

Real-world reliability is shaped by execution controls, feedback loops, governance, evaluation, and operational design, not only by model capability.

Claim 2 ETCLOVG separates production concerns.

Execution, Tooling, Context, Lifecycle, Observability, Verification, and Governance expose architectural boundaries that earlier frameworks often conflate.

Claim 3 A broad ecosystem map reveals gaps.

A systematic mapping of the open-source ecosystem surfaces adoption patterns across sandboxes, protocols, memory systems, orchestrators, observability platforms, benchmarks, and governance stacks.

Three Engineering Phases

Read across 2022–2026, agent engineering has gone through a coherent shift in where the marginal effort lands. The three phases overlap in time and concept; they describe what the field has chosen to engineer, not a clean sequence of replacements.

2022–2024

Prompt engineering. The primary lever is the input prompt text: instructions, few-shot examples, and reasoning templates, all optimized for a single model call.

2025

Context engineering. The question shifts from “what is the input?” to “what should the model see at each step?” The scope expands to retrieval, compaction, tool-result ranking, and managing context-window saturation across turns.

2026–

Harness engineering. As models become capable enough to attempt long-running tasks, the engineering focus expands to the full infrastructure wrapper: execution environment, tool interface, context, lifecycle, observability, verification, and governance.

Timeline of Agent-Harness Systems

The same shift is visible in the systems themselves. The ReAct era of 2022–2023 wrapped a single model loop with a while-loop, a prompt template, and a small tool dispatch table; AutoGPT and BabyAGI exposed the resulting failures, including execution runaway, context blowout, state loss, and unmonitored side effects, as infrastructure problems rather than prompt problems. Tool integration and multi-agent coordination from 2023–2024 added learned tool use (Gorilla, ToolLLM, Toolformer), role-playing organizations (CAMEL, ChatDev, MetaGPT, Mixture-of-Agents), the first agent benchmarks (SWE-bench, AgentBench, WebArena, GAIA), and the beginnings of protocol standardization (MCP, A2A). By 2025–2026 enough deployment experience had accumulated that “harness engineering” began to be named as a discipline of its own, accompanied by automated harness optimization and a wave of results in which only the harness was varied.

Representative agent-harness systems by ETCLOVG layer, 2022–2026.

The ETCLOVG Taxonomy

We organize the harness into seven layers. The first four describe the structural core of a harness; the last three describe the control plane around it. Compared with earlier six-component frameworks, Observability and Governance appear here as independent layers because, in production deployments, each has its own tooling stack and is owned by a different team.

The ETCLOVG taxonomy. E, T, C, and L form the structural pillars; O provides system-wide monitoring; V delivers evaluation and feedback; G enforces governance constraints across the system.

Execution environment. Determines where agent code runs and what sandbox constraints bound it: managed sandboxes, microVMs, code-specialized runtimes, computer-use environments, browser sandboxes, and OS-level permission...

Agent Harness Engineering: A Survey

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy