Autonomous Long-Running Coding Agents - by elvis
AI Newsletter
SubscribeSign in
Autonomous Long-Running Coding Agents<br>What is the big deal with loop engineering and autonomous long-running agents.
elvis<br>Jun 15, 2026
Share
Autonomous coding is moving from better prompting to better control systems. The important shift is that engineers are learning how to wrap agents in goals, evaluators, loops, and artifacts that let them keep working after the human stops typing.<br>This matters because most serious engineering work spans long horizons: ambiguous requirements, hidden constraints, partial failures, changing context, and repeated verification. The new frontier is designing the system around the agent so it can plan, execute, check its work, recover from mistakes, and keep making progress without constant human steering.<br>This piece is based on a DAIR.AI Academy session on autonomous long-running coding agents, where I walked through Claude Code’s /goal mode, the newer /loop command, verifiers, artifacts, and orchestration patterns in practice. Written in collaboration with Codex and Claude Code.<br>From Prompting to Goal Design
The core idea behind features like Claude Code’s /goal is simple. A coding agent remains the executor, but the human no longer interacts with it turn by turn. Instead, the human specifies the desired end state, the evidence required to prove success, the constraints that must not be violated, and, where possible, the number of turns and budget.<br>That goal works more like a contract than a longer prompt. A weak goal gives the model room to stop early, take shortcuts, or redefine success in a way that looks plausible in the transcript but fails in the real system. A strong goal gives the agent a target it can repeatedly measure itself against.<br>Engineering judgment still matters here. The best goals encode domain knowledge that the model would otherwise guess. For a research experiment, that might mean a target benchmark score, a held-out evaluation, a required loss curve, and a rule that the result must beat an initial baseline. For a UI task, it might mean a screenshot reference, concrete layout constraints, and a browser verification step. The model can execute, but the human still defines what “done” actually means.<br>The Evaluator Becomes a First-Class Component
Long-running agents need a second role besides the goal. That evaluator can be another coding agent, an LLM-as-judge, a script, a test suite, a benchmark harness, or a mix of all of them. The key design choice is matching the evaluator to the task. When success is crisp, deterministic checks are better. Type checks, unit tests, lint rules, integration tests, and benchmark scripts should be used whenever they can express the condition clearly.<br>When success is fuzzy, an agent evaluator becomes useful. A script can tell you whether tests pass, but it cannot easily decide whether a generated research report is coherent, whether an implementation faithfully follows a paper, or whether a UI matches a design intent. This is where the evaluator benefits from language, judgment, and sometimes vision.<br>The practical pattern uses deterministic checks as the floor and agent evaluation as the higher-level review. That combination reduces hallucinated success while still allowing autonomy on tasks that do not fit cleanly into a test assertion.<br>Verifiers Define the Boundary of Trust
The deeper point is that autonomy only works when the system has a reliable verifier. A coding agent can generate a plan, implement a feature, and explain why it believes the work is complete, but that explanation should not be treated as evidence. Evidence comes from an external check that the agent cannot easily talk its way around.<br>For code, the verifier might be a test suite, type checker, benchmark, browser run, screenshot comparison, or reproducible script. For research work, it might be a held-out evaluation, a reproduced table, a loss curve, or a benchmark score that improves over the baseline. For design work, it might be a reference screenshot plus a visual review step. The verifier is what turns a long-running agent from a confident text generator into a system that can be trusted with more time.<br>Most shortcuts appear at this boundary. If the verifier is vague, the model will often satisfy the easiest interpretation of the task. If the verifier is too narrow, the model may overfit to it and miss the broader intent. A good autonomous workflow, therefore, needs layered verification, with cheap deterministic checks catching basic failures and higher-level review catching judgment-heavy failures. A few of the frontier models can already achieve some level of verification, but based on my research, there is still an evident OOD problem, where if the verification task you assign to the agent falls outside the training distribution, models struggle significantly.<br>Verifiers are still an open area of research, but I anticipate more companies will start to...