Two AI judges scored our agent's answer 0.85, but it never opened the file

LLM-as-Judge Became the Default for Agent Evaluation (And It Can't See the Failure That Matters) | Tenure Platform Project Resume Memory Modes PR Review Editor Integrations VS Code VSCodium Compatible Clients Mobile Open WebUI OpenClaw Teams Teams Overview Shared Memory AI Governance EU AI Act Compliance How it works Belief Merging Contradiction Handling Memory Modes AI Governance Resources Docs Writing Benchmark Paper GitHub

Install Free

Writing › Agent evaluation Research LLM-as-judge became the default for agent evaluation (and it can't see the failure that matters)

A judge model can only see the final answer. It cannot see whether the agent was allowed to know what it claimed, when it could have known it, or whether an absence claim was ever actually checked. The number on the leaderboard is silent on all three.

Tenure research · ~9 min read

TL;DR LLM-as-judge became the default for agent evaluation because it is the only general-purpose tool available for open-ended tasks. That does not make it the right tool for every failure class. A judge model compares a final answer to a correct answer. It has no visibility into the path the agent took to produce that answer, and no way to check whether the agent was permitted to use the evidence it used. In a case study, two frontier judges scored an agent response 0.85. The agent had never opened the document its answer depended on. It asserted the document didn't exist and answered anyway. GroundEval scored it 0.000, full case study and code in the repo. This is a distinct failure class from tool-use mechanics. Trajectory-aware benchmarks already check whether an agent called the right tools in the right order. None of them check whether the agent was allowed to know what it claimed, or whether an absence claim was earned by sufficient search. The fix is a deterministic state contract: an access policy, an event log, and artifact timestamps, checked against the trajectory without a judge in the loop.

How we got here The framing that locked in the wrong question

When agent benchmarks need to score something with no single correct string, a free-text explanation, a multi-step research task, a tool-using trajectory that could reasonably end a few different ways, the standard move is to hand the question, the response, and a reference answer to a judge model and ask it to grade. Humanity's Last Exam popularized a specific version of this: extract the final answer, compare it to a known correct answer, output yes or no. It is fast, it is general-purpose, and it requires no domain-specific scaffolding. That is exactly why it spread.

The judge prompt itself is explicit about what it is checking. It instructs the model to focus only on whether the extracted final answer matches the correct answer, and not to comment on background, not to argue for a different answer, not to solve the problem itself. That scope is deliberate and reasonable for what it is built to do: confirm a string match with tolerance for phrasing and numerical variance. It was never built to check anything about how the answer was produced.

Nobody asked whether final-answer matching was sufficient for agents that act in the world rather than just answer questions. The infrastructure was already there, the prompt template was already written, and grading an agent's output looked like grading a model's output. It isn't. The difference is the entire problem.

A judge model reading a final answer is checking whether the destination matches the map. It has no way to check whether the agent actually walked the route, or teleported there on a lucky guess. Both produce the same string. Only one of them is trustworthy the next time the terrain changes.

The case study 0.85 from two judges. 0.000 from the trace.

An agent was asked a question whose answer depended on a specific Confluence page. The agent responded as though it had checked. It described, in plausible and confident language, why the page in question did not exist and answered the question on that basis.

Two separate frontier judge models read the question and the response and scored it 0.85. Both judges found the answer well-reasoned and the explanation coherent. Neither judge had access to anything other than the question and the final response, which is the entire design of the grading prompt: extract the final answer, compare it to ground truth, ignore everything else.

The trace told a different story. The agent never fetched the page. It never opened it, never searched for it, never issued a query that would have surfaced it. It asserted absence without having done the work that an absence claim requires, and then reasoned forward from that unverified assertion as though it were fact. Scored against the recorded trajectory and the access policy that governed it, the response receives 0.000.

Same response, two scoring methods Scoring method What it checks Score Judge model A Does the final answer match the expected answer 0.85...

Two AI judges scored our agent's answer 0.85, but it never opened the file

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org