How good a detective is an AI? A Sherlock Holmes board game as an LLM-agent eval

How good a detective is an AI?

🔎">

How good a detective is an AI?

A Sherlock Holmes board game as an LLM-agent eval

It started at a dinner. A few friends and I sat down to play Sherlock Holmes Consulting Detective — an open-ended deduction game where you’re handed a Victorian London case, you pick which people and places to go investigate, and each lead hands you a passage of text to read. Most of the game is reading, cross-referencing, and arguing at the table. At the end you answer the case’s questions and score yourself against Holmes himself — including how few leads you needed. The answers sit in the back of the booklet, printed upside-down, daring you not to peek.

We walked straight into the trap the case is built around. There’s an obvious victim — a man every detail points to as the target — and we hung our whole theory on him. But one clue wouldn’t sit still. The morning after the murder, the killer goes back to a shipping office and scans the passenger list again. We re-read the passage three times. Why would he do that? If he’d already killed the person he was after, what was he still looking for? Something didn’t close, and none of us could say what.

So, at 2am — out of wine and out of steam — we did the forbidden thing: we turned the booklet over. And there in the answer key, a name we’d treated as background all evening stepped forward as the real undercover agent — alive, never caught, the person the killer was still hunting. The passenger-list visit wasn’t a loose end. It was the case. We’d held the contradiction in our hands — we’d even said out loud that it was strange — and we’d read right past it.

That non-closing feeling is the thing that stuck. We weren’t short on information; we had every clue we needed. We were short one inference — the small, second-order turn from “that’s a strange thing for the killer to do” to “then the whole story we’ve built is wrong.” So I started to wonder: how good a detective is an AI, really? Handed the same leads, would an LLM agent read that behavior as a behavior, notice it broke the obvious story, and follow it to the live agent we’d missed?

To find out, I turned the game into an eval for LLM agents. The agent plays the Irregulars — the Baker Street street kids Holmes sends out to do his legwork.

On its first run, Claude Fable 5 tied Holmes — in the hard mode, where you don’t even get to see the questions until the investigation is over.

That’s the headline. But the score isn’t the story. The interesting part is the two distinct ways these agents fail — and that the harder failure, the exact one that beat us at dinner, has a clean fix that turned out to be less about model size than I expected.

Why a board game is a surprisingly honest agent eval

What I didn’t see at the table that night is that we’d just lost to an unusually clean agent benchmark. Most agent benchmarks have a problem: the answer is somewhere in the context, or the task is gameable, or “success” is graded loosely. A printed detective game sidesteps all three by construction:

The solution is physically hidden. Those upside-down answers never enter the agent’s allowed workspace; reading them would be a detectable protocol violation, and I audit for it.

Information has a price. Thinking, re-reading, and cross-referencing are free and unlimited. But acting — visiting a location to pull a new clue — is the only way to get new information, and every new clue beyond what Holmes used costs points . That’s a miniature of real agent economics: every tool call costs something.

It rewards comprehension, not retrieval. Clues are behaviors and details you have to assemble into one coherent story; none of them hands you the answer.

The mechanics that make this auditable, in one breath: the agent works in a sandbox containing only what it’s allowed to see; a deterministic Game Master (plain Python, not an LLM) serves clues verbatim and logs everything; visits cost points and the solution lives outside the agent’s reach; and a separate validator — the only component that reads the solution — cross-checks the log against the answers afterward. (More on the isolation in How it’s built below; full mechanics in the repo.)

A note on words: I’ll call it cheat-resistant , not cheat-proof. It’s a commercial game, so I can’t rule out that some of the case leaked into pretraining, or that an agent could steer its exploration with latent knowledge it never names in an answer. What I can show is that the agents’ mistakes are consistent with only the information they were served — strong evidence, not proof.

The two ways it fails

Across a ladder of models (Claude Haiku 4.5 → Claude Sonnet 4.6 → Claude Opus 4.8 → Claude Fable 5), two failure modes show up again and again. They’re worth naming because they’re not specific to board games — they’re how LLM agents fail at any multi-step retrieval-and-reasoning task.

Failure 1 — Execution: preferring what you generated to what you retrieved

The...

How good a detective is an AI? A Sherlock Holmes board game as an LLM-agent eval

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI