Building a Deep Research Agent That Survives Its Own Failures

Building a Deep Research Agent That Survives Its Own Failures - Steel | Open-source Headless Browser API

')">

Sessions API

Pricing

Blog

Docs

We're hiring!

7.2K

← Back to Blog

Building a Deep Research Agent That Survives Its Own Failures

Jun 11, 2026

San Francisco

Nikola Balic

I have been fascinated by deep research agents for a while now, studying every shape I stumble on. In part one I took apart the deep-research harness inside Claude Code. This is what happened when I built my own and pointed real evals at it. Then rebuilt it. TLDR: Durable Researcher is a browser-native deep research agent that checkpoints every model step, rebuilds state from the transcript, routes tasks by answer shape, and uses eval failures as the product loop.

The point was to make every failure leave enough state behind that I could resume it, inspect it, and turn it into the next change. Make a better agent. The first version was good at the wrong thing. It wrote beautiful overviews. It planned sub-queries, browsed in parallel, took notes, checked its coverage, and produced a polished report. Ask it to survey a field and it shined. Ask it for one number, like a cash-flow figure from a filing, and it handed you a thoughtful essay about the company instead. So I added a citation verifier. It checked every claim in the report against the agent's notes and triggered a rewrite when the evidence was thin. Clean. Obviously the kind of thing a research agent should have. The evals said it made things worse. I switched it off. Build what seems right. Test it on real tasks. Read the failures. And when the data turns on you, quarantine your cleverness until you understand it. The Durable Research Loop Durable Researcher takes a topic, plans sub-queries, runs them in parallel against real browser sessions on Steel, takes structured notes, checks its own coverage, fills the gaps, and writes a report. Steel matters because the agent needs browsers, not HTTP fetches. The useful research lives on pages that render late, redirect, block scrapers, or hide their content behind behavior. A plain fetch sees none of it. Every model message is checkpointed to Postgres as it happens. If the agent dies, the next run resumes from the last checkpoint. The model never knows it crashed. That is only the first layer. Notes, visited URLs, claims, and source ledgers rebuild from the transcript. Scraped pages are cached in Postgres, keyed by task and URL, so a crash does not mean paying Steel to browse the same page again. Verification attempts are checkpointed too, which means a bad rewrite leaves a trail: claim verdicts, pass rate, unsupported lines, and reasons. The stack: Bun and TypeScript, Pi for the agent loop, Absurd for durable execution, Steel for browsers, Postgres for persistence, a GLM-5.1 (cheap, fast and smart) model for reasoning, and Ink for the terminal UI.

None of it is exotic. The product lives in the fit between the pieces. Campaign mode pushes the same idea further. Long research runs are split into bounded pulses, each with its own task ID, objective, report, judge decision, usage, and source inventory. That avoids one giant long conversation while preserving the user-visible behavior of one continuous run. Part one found Claude Code's harness stuck in a single-pass pattern: scope once, search once, never let a finding change the next query. Mine had the same limitation until I taught it to take the second hop, which is most of what follows. The Numbers, With the Caveats Up Front There are two useful academic benchmarks for this kind of system. ResearchRubrics, from Scale AI, has 101 tasks with weighted criteria judged by an LLM. DRACO, from Perplexity, has 100 tasks with a similar shape. Benchmark Judge Tasks Score How to read it ResearchRubrics Gemini default 10/101 59.8% Promising, partial DRACO Gemini 3.1 Pro 10/100 47.1% Paper-comparable, weak sample For reference, the published full-set ResearchRubrics scores are Gemini Deep Research at 61.5%, OpenAI Deep Research at 59.7%, and Perplexity Deep Research at 48.7%. The judge belongs in the headline of the metric. LLM-as-judge numbers are useful, but only if every chart clearly says which judge produced them. So the honest claim is narrow: promising on ResearchRubrics, lower-middle on the Gemini-judged DRACO sample. The full 100-task DRACO run still matters most. But for speed of the loop I have opted for a random 10% subsample. Sprint One: Make Failure Resumable The first sprint built the substrate. Every assistant message becomes a step in the durable engine. On resume, the engine replays the completed steps and hands the conversation back to the model. There is no notes table. No visited-URL table. Tool calls and their results live in the message log, and the app rebuilds notes and URLs by walking it. The transcript is the canonical artifact. If the log...

Building a Deep Research Agent That Survives Its Own Failures

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs