Does the Harness Matter? Lessons from ALE-Claw on Agents’ Last Exam — Agents' Last Exam | Agents' Last Exam<br>← BlogAnalysisJune 11, 2026·11 min read<br>Does the Harness Matter? Lessons from ALE-Claw on Agents’ Last Exam<br>A strong agent needs a harness. The question is how much harness is enough.<br>By Yixiao Huang and Yiyou Sun
TL;DR<br>On Agents' Last Exam (ALE) , model choice moves the score more than harness choice: under a fixed OpenClaw harness, the model sweep spans 18.0 percentage points , while fixed-model harness sweeps span only 5 to 6 points .<br>We built ALE-Claw , a deliberately small computer-use harness derived from OpenClaw. It removes product-assistant machinery while keeping the core agent loop, and reaches the same accuracy band with 44% fewer input tokens , 41% lower cost , and 60% less wall-clock time than OpenClaw.<br>A richer harness is not automatically a better one. Across the five GPT-5.5 harnesses, neither a bigger tool surface nor a heavier product layer came with a higher score.
1. The Debate<br>Since the launch of ReAct, agent builders have spent the last few years making harnesses more elaborate. The basic loop is still the same: build context, call the model, dispatch a tool, observe the result, compact or prune context, and repeat until the agent submits a final answer.<br>Around that loop, production systems add a lot: memory, skills, planning tools, sub-agents, and user preferences. Some of those features clearly matter in interactive products. If an assistant is supposed to work with a human over weeks, remember preferences, ask clarifying questions, and recover from ambiguous instructions, a thin benchmark loop is not enough.<br>The recent Terminal-Bench writeups make this point forcefully. KRAFTON's Terminus-KIRA post argues that a very minimal terminal harness left frontier models with avoidable failure modes. ForgeCode's “Benchmarks Don't Matter — Until They Do” post tells a similar story from the other direction: the same model moved from weak benchmark performance to state of the art after the runtime was made non-interactive, faster, and stricter about tools and planning.<br>So the natural takeaway is: harnesses matter a lot .<br>There is also a cautionary version of the same story. A recent DebugML audit found cheating or reward hacking across 28+ submissions and 9 benchmarks , including Terminal-Bench scaffolds that exposed verifier files or injected non-official AGENTS.md answer keys into the agent context. In one audited case, replacing tainted ForgeCode traces with clean-scaffold runs on the same model dropped pass rate from 81.8% to 71.7% . This leads one to ask: did the scaffold genuinely make the agent better, or did it just cheat or overfit to the benchmark?<br>ALE gives us a chance to test how far that claim travels in a different setting. Terminal-Bench is centered on terminal tasks, where a task- or domain-specific harness can be a large advantage. ALE instead asks agents to handle long-running professional work that can last for hours and span many industries, which pushes the harness toward a more general computer-use interface.<br>In particular, we are asking:<br>If the model is strong, how much does the harness still move the score?
2. ALE and the Shared GCUA Harness<br>Agents' Last Exam evaluates Generalist Computer-Use Agents (GCUAs) : agents that can operate across shell, files, GUI applications, and web research rather than only inside one terminal workflow.<br>ALE is designed to measure sustained performance on long-horizon, economically valuable, real-world work with verifiable outcomes. Developed with industry experts and grounded in the O*NET-SOC taxonomy, the public benchmark spans around 150 tasks across 55 subfields and 13 industry clusters . Many runs can last for hours, so ALE stresses whether a harness can support broad professional computer work without being tuned to one narrow task family.<br>We evaluate ALE across multiple frontier models and agent harnesses, and the results show that the benchmark is far from saturated. The strongest configuration reported, Codex with GPT-5.5, is below 50% full-pass on the easiest tier and below 10% on the hardest. The average full-pass rate on the hardest tier is 2.6% .<br>The shared GCUA harness has to be general enough to support that range. In ALE, the common starting point is this structure:
Shared GCUA harness architecture: a main loop around a system-prompt builder, tool system, and context manager. Reproduced from Figure 5 in our paper.Main loop. Calls the model, dispatches actions, observes results, and repeats.<br>Prompt builder. Assembles task instructions, runtime metadata, tool guidance, and behavioral rules.<br>Tool system. Exposes shell, files, web, GUI actions, and sometimes background processes or sub-agent delegation.<br>Context manager. Keeps long trajectories inside the model context window.<br>This shared core is the starting point. The question is whether the product layer built around it, including memory, skills, preferences, and...