Zenith: frontier performance without Fable - Intelligent InternetBack to BlogZenith: frontier performance without Fable<br>29 Jun 2026<br>TL;DR<br>Bottom line up front<br>01Zenith is our long-running agent harness. Point it at a hard, long-running engineering task, and it builds a harness tailored to that task, runs the work, and tests it to the finish.
02On Frontier SWE, seventeen of the hardest public long-horizon software-engineering tasks we know of, Zenith took GPT-5.5 from 5th place to 1st, ahead of Claude Fable, by building the right harness around the model instead of swapping in a bigger one.
03That matters now because the strongest models are increasingly hard to reach. Fable 5's access was suspended under a US export-control directive and remains gated; GPT-5.6 launched as a limited preview for a small group of trusted partners. The system around the model is the part that builders can still own and improve.
04Two ideas make long-running agents work: Zenith keeps the agent planning, testing, and improving across many sessions, while Meta-Zenith continuously learns how to build the right Zenith harness from feedback on real collected tasks.
Harness > Model for frontier performance
When an agent stalls on a hard, long-running task, the reflex is to reach for a stronger model. Zenith takes the other path. The model defines what's possible within a single session; the harness around it determines how sessions are arranged, how state is carried across them, how progress is checked, and when to stop. That is the part Zenith improves. Point Zenith at the task, and it builds a harness shaped to it, then drives the work to completion, testing as it goes. On Frontier SWE, the hardest long-horizon software-engineering benchmark we know of, that approach lifted GPT-5.5 to the top, past Claude Fable.
Zenith is the harness that keeps the agent working, testing, and improving over the long run. Meta-Zenith is how we generate the right Zenith for a new task.
And right now, reaching for a bigger model often isn't an option. Claude Fable 5, the model atop the public Frontier SWE board, had its access suspended under a US export-control directive and is still restricted to government-approved channels. OpenAI's GPT-5.6 Sol launched as a limited preview for a small group of trusted partners. When the strongest models are gated like this, the system around the one you can run is where your leverage is.
Fifth to first on Frontier SWE
Frontier SWE, built by Proximal, is not a normal coding benchmark. Its seventeen tasks are ultra-long-horizon: agents get twenty hours per task, and even so, most barely make progress. Prime Intellect, which hosts the benchmark on its EnvironmentsHub, reports that agents run about eleven hours per task on average and fail to solve almost all of them.
We ran the full suite with Zenith and scored it the same way Proximal scores the public leaderboard. By mean@5 the result lands first, at 2.06 average rank, with 92% dominance. The same GPT-5.5 model on its default Codex harness sits fifth, at 5.53. The benchmark, the trial budget, and the base model stayed the same. The only thing that changed was the control loop around the model.
Everything needed to check it is going in the release: task configs, model and version settings, per-run budgets, run logs, evaluator outputs, and the full results with cost, token, and runtime summaries.
The gap between harnesses is widest on implementation, the longest-horizon work on the benchmark. GPT-5.5 under Codex ranks 7.40 there. With Zenith, it ranks 1.60, ahead of every other entry, including Fable. That is the result we care about most, because Anthropic's own launch notes say the longer and more complex the task, the larger Fable's lead grows.
Implementation is also where the benchmark exposes the failure mode that Zenith exists to fix. Proximal found that on almost every task, models submit early, long before the time limit, not because they give up but out of overconfidence: the tests they write for themselves are superficial enough to let a wrong solution look right. The agent declares victory while the Git test suite still fails to decompress pack files, or while the kernel it just "optimized" is still slower than the reference on the shapes that matter. Independent, evidence-traced testing is the part of the harness that catches this, and it is what separates 7.40 from 1.60.
From a sketch to a harness, and continuous improvement
The first Zenith harness was built by hand. We manually wrote the system prompts, default skills, worker definitions, validation rules, and orchestration logic. To get there, we studied five harness designs across eight long-horizon tasks and isolated the control mechanisms that consistently mattered: repeated gap-finding, revisable planning, independent verification, adaptive orchestration, and disciplined stopping. We then combined these mechanisms into a single adaptive system that achieved the best mean rank while...