Does the Harness Matter? Lessons from Ale-Claw on Agents' Last Exam

Does the Harness Matter? Lessons from ALE-Claw on Agents’ Last Exam — Agents' Last Exam | Agents' Last Exam ← BlogAnalysisJune 11, 2026·11 min read Does the Harness Matter? Lessons from ALE-Claw on Agents’ Last Exam A strong agent needs a harness. The question is how much harness is enough. By Yixiao Huang and Yiyou Sun

TL;DR On Agents' Last Exam (ALE) , model choice moves the score more than harness choice: under a fixed OpenClaw harness, the model sweep spans 18.0 percentage points , while fixed-model harness sweeps span only 5 to 6 points . We built ALE-Claw , a deliberately small computer-use harness derived from OpenClaw. It removes product-assistant machinery while keeping the core agent loop, and reaches the same accuracy band with 44% fewer input tokens , 41% lower cost , and 60% less wall-clock time than OpenClaw. A richer harness is not automatically a better one. Across the five GPT-5.5 harnesses, neither a bigger tool surface nor a heavier product layer came with a higher score.

1. The Debate Since the launch of ReAct, agent builders have spent the last few years making harnesses more elaborate. The basic loop is still the same: build context, call the model, dispatch a tool, observe the result, compact or prune context, and repeat until the agent submits a final answer. Around that loop, production systems add a lot: memory, skills, planning tools, sub-agents, and user preferences. Some of those features clearly matter in interactive products. If an assistant is supposed to work with a human over weeks, remember preferences, ask clarifying questions, and recover from ambiguous instructions, a thin benchmark loop is not enough. The recent Terminal-Bench writeups make this point forcefully. KRAFTON's Terminus-KIRA post argues that a very minimal terminal harness left frontier models with avoidable failure modes. ForgeCode's “Benchmarks Don't Matter — Until They Do” post tells a similar story from the other direction: the same model moved from weak benchmark performance to state of the art after the runtime was made non-interactive, faster, and stricter about tools and planning. So the natural takeaway is: harnesses matter a lot . There is also a cautionary version of the same story. A recent DebugML audit found cheating or reward hacking across 28+ submissions and 9 benchmarks , including Terminal-Bench scaffolds that exposed verifier files or injected non-official AGENTS.md answer keys into the agent context. In one audited case, replacing tainted ForgeCode traces with clean-scaffold runs on the same model dropped pass rate from 81.8% to 71.7% . This leads one to ask: did the scaffold genuinely make the agent better, or did it just cheat or overfit to the benchmark? ALE gives us a chance to test how far that claim travels in a different setting. Terminal-Bench is centered on terminal tasks, where a task- or domain-specific harness can be a large advantage. ALE instead asks agents to handle long-running professional work that can last for hours and span many industries, which pushes the harness toward a more general computer-use interface. In particular, we are asking: If the model is strong, how much does the harness still move the score?

2. ALE and the Shared GCUA Harness Agents' Last Exam evaluates Generalist Computer-Use Agents (GCUAs) : agents that can operate across shell, files, GUI applications, and web research rather than only inside one terminal workflow. ALE is designed to measure sustained performance on long-horizon, economically valuable, real-world work with verifiable outcomes. Developed with industry experts and grounded in the O*NET-SOC taxonomy, the public benchmark spans around 150 tasks across 55 subfields and 13 industry clusters . Many runs can last for hours, so ALE stresses whether a harness can support broad professional computer work without being tuned to one narrow task family. We evaluate ALE across multiple frontier models and agent harnesses, and the results show that the benchmark is far from saturated. The strongest configuration reported, Codex with GPT-5.5, is below 50% full-pass on the easiest tier and below 10% on the hardest. The average full-pass rate on the hardest tier is 2.6% . The shared GCUA harness has to be general enough to support that range. In ALE, the common starting point is this structure:

Shared GCUA harness architecture: a main loop around a system-prompt builder, tool system, and context manager. Reproduced from Figure 5 in our paper.Main loop. Calls the model, dispatches actions, observes results, and repeats. Prompt builder. Assembles task instructions, runtime metadata, tool guidance, and behavioral rules. Tool system. Exposes shell, files, web, GUI actions, and sometimes background processes or sub-agent delegation. Context manager. Keeps long trajectories inside the model context window. This shared core is the starting point. The question is whether the product layer built around it, including memory, skills, preferences, and...

Does the Harness Matter? Lessons from Ale-Claw on Agents' Last Exam

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs