Code as Agent Harness

Preprint 2026 · Survey

Toward Executable, Verifiable, and Stateful Agent Systems

A code-centered view of agentic AI: code is not only generated output, but the operational substrate for reasoning, acting, environment modeling, execution feedback, and multi-agent coordination.

Paper arXiv GitHub Cite

agent-harness.sh

Xuying Ning1†, Katherine Tieu1†, Dongqi Fu2†, Tianxin Wei1†, Zihao Li1†, Yuanchen Bei1†, Jiaru Zou3, Mengting Ai1, Zhining Liu1, Ting-Wei Li1, Lingjie Chen1, Yanjun Zhao1, Ke Yang1, Bingxuan Li1, Cheng Qian1, Gaotang Li1, Xiao Lin1, Zhichen Zeng1, Ruizhong Qiu1, Sirui Chen1, Yifan Sun1, Xiyuan Yang1, Ruida Wang1, Rui Pan1, Chenyuan Yang1, Dylan Zhang1, Liri Fang1, Zikun Cui2, Yang Cao2, Pan Chen2, Dorothy Sun2, Ren Chen2, Mahesh Srinivasan2, Nipun Mathur2, Yinglong Xia2, Hong Li2, Hong Yan2, Pan Lu3, Lingming Zhang1, Tong Zhang1, Hanghang Tong1§, Jingrui He1§

1University of Illinois Urbana-Champaign · 2Meta · 3Stanford University · †Core Contributor · §Corresponding Author

Connected Layers

6+ Application Areas

102 PDF Pages

450+ Cited Work

Abstract

Code becomes the runtime medium for agents.

Recent LLMs have become strong code generators, but emerging agentic systems use code for more than final answers. This survey frames code as an agent harness: a unified infrastructure layer for agent reasoning, action, environment modeling, feedback-driven control, and verification. It studies how code connects agents to executable steps, durable state, reusable tools, tests, traces, repositories, and multi-agent workflows.

Taxonomy

Three Layers of Code as Harness

01 Harness Interface

Code connects agents to reasoning, action, and environment modeling: executable reasoning traces, programmable actions, DOM/API interfaces, simulators, tests, and state representations.

Reasoning substrate

Action interface

Environment representation

02 Harness Mechanisms

Planning, memory, tool use, control, and optimization sustain agents over long-horizon execution. Failures become feedback for repair rather than dead ends.

Planning and decomposition

Working and long-term memory

Tests, traces, and static analysis

03 Scaling the Harness

Shared code artifacts allow multiple agents to coordinate, review, test, debate, red-team, and verify progress inside a common repository or workflow state.

Manager, planner, coder, reviewer, tester roles

Centralized and distributed workflows

Shared state and collective verification

Applications

Where the harness shows up

Coding Assistants GUI / OS Agents Embodied Agents Scientific Discovery Personalization Recommendation DevOps Enterprise Workflows

Open Problems

Harness engineering is the hard part.

Evaluation beyond final success

Intermediate states, traces, repair attempts, and safety checks need first-class metrics.

Verification with incomplete feedback

Agents must act under partial tests, noisy execution signals, and hidden environment state.

Regression-free improvement

Harnesses should learn from failure without silently breaking previously working behavior.

Shared state across agents

Coordination depends on durable memory, repository state, review artifacts, and permissions.

Citation

BibTeX

Copy @misc{ning2026codeagentharness, title = {Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems}, author = {Xuying Ning and Katherine Tieu and Dongqi Fu and Tianxin Wei and Zihao Li and Yuanchen Bei and Jiaru Zou and Mengting Ai and Zhining Liu and Ting-Wei Li and Lingjie Chen and Yanjun Zhao and Ke Yang and Bingxuan Li and Cheng Qian and Gaotang Li and Xiao Lin and Zhichen Zeng and Ruizhong Qiu and Sirui Chen and Yifan Sun and Xiyuan Yang and Ruida Wang and Rui Pan and Chenyuan Yang and Dylan Zhang and Liri Fang and Zikun Cui and Yang Cao and Pan Chen and Dorothy Sun and Ren Chen and Mahesh Srinivasan and Nipun Mathur and Yinglong Xia and Hong Li and Hong Yan and Pan Lu and Lingming Zhang and Tong Zhang and Hanghang Tong and Jingrui He}, year = {2026}, eprint = {2605.18747}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2605.18747},

Code as Agent Harness

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast