Does anyone in your organisation own "correctness" in your AI products?

avikalp2 pts0 comments

Nobody in Your Organization Owns 'Correct'

Alokit's Substack

SubscribeSign in

Nobody in Your Organization Owns 'Correct'

Alokit<br>May 22, 2026

Share

Harvey AI is one of the best-resourced legal AI companies in the world. Hundreds of millions in funding. Elite law firm customers. A team that has been building in this space longer and harder than almost anyone.<br>In April 2026, they published a post-mortem on their document review algorithm.<br>The algorithm was working. It was being used by lawyers at major firms. It was passing tests. And they rebuilt it anyway — because it was insufficiently verifiable.<br>The original system produced citations attached to whole cells, not individual statements. Harvey, as they put it, "didn't reveal its line of thinking." Lawyers are paid to exercise diligence: to check sources, to understand reasoning, to stake their professional reputations on what they submit. An algorithm that delivered conclusions without per-statement reasoning made that impossible.<br>The rebuild produced answers with per-statement citations and exposed reasoning at each step. Not because the original was broken. Because verification requires that you can see the work.<br>That's the insight worth taking seriously: insufficiently verifiable is a failure mode distinct from incorrect. Most organizations haven't built their verification infrastructure for the first problem because they're focused entirely on the second.

The Score That Looked Good While Being Useless

Anthropic ran an AI safety evaluation using a benchmark called BBQ — designed to measure whether their models exhibited harmful social biases. The results came back encouraging. Bias scores close to zero. Progress appeared.<br>One of the benchmark's original developers asked a simple follow-up question: were the models actually answering the questions?<br>They weren't. The models were deflecting — refusing or giving non-answers — which registered as technically unbiased because a non-answer can't be biased in any direction. The safety score looked good. The underlying behavior was useless.<br>Anthropic named this the core failure mode of automated evaluation: "All evaluations are subject to the failure mode where you overinterpret the quantitative score and delude yourself into thinking that you have made progress when you haven't."<br>This isn't a niche problem. Scale AI published HiL-Bench in April 2026 — a benchmark specifically designed to test whether agents know when they're missing context. Results: frontier agents solved up to 89% of software engineering tasks when given full context. Remove a few realistic details — an ambiguous requirement, a buried constraint, an unresolved contradiction — and the best model's completion rate dropped to 24%. More troubling: the agents didn't ask for help. They guessed confidently and shipped wrong outputs.<br>An 89% benchmark score describing a system that would fail 76% of the time under realistic conditions. The score didn't lie — it measured something other than what mattered.

The Self-Referential Problem

The natural response to unreliable benchmarks is to evaluate dynamically — use AI to evaluate AI outputs in real time. LLM-as-judge. A second model assessing the first.<br>In April 2026, a thread in r/LLMDevs titled "LLM-as-judge is not a verification layer. It is a second failure mode" gathered 62 comments from practitioners actively shipping AI in production.<br>The most-upvoted reply: "Two probabilistic systems do not add up to a deterministic one. You have not added a verification layer."<br>Anthropic named this the "ouroboros" problem. When AI systems generate evaluations for other AI systems, those evaluations inherit the same tendencies — the same biases, the same hallucination patterns, the same blind spots — as the systems being evaluated. The evaluator eats its own tail.<br>LLM-as-judge isn't useless. It's valuable for semantic evaluation at volumes no human review process could cover. But practitioners who get the most from it have learned to split: compliance checks (schema validation, scope boundaries, authorization rules) stay deterministic. LLM judge handles only what genuinely needs contextual judgment. Most teams default to LLM-for-everything because it's the easiest reach. That's where cost and reliability problems compound.

The Right Outputs in the Wrong Place

There's a failure mode more dangerous than hallucination that most verification frameworks aren't built for: correct content, wrong scope.<br>A performance coaching agent is given a task: draft feedback notes for a manager's direct reports and share them. The agent drafts thoughtful, accurate notes. It shares them — to an org-wide Slack channel. Private performance feedback surfaces to 400 people.<br>The agent didn't hallucinate. The content was reasonable. The scope was catastrophic.<br>Standard verification asks: did the model hallucinate? Here the answer is no. But the output was still wrong — in a way that neither the model nor the retrieval system...

because verification failure mode score didn

Related Articles