AI: The Thirty Percent Confession

The Thirty Percent Confession | I, Cringely

The Thirty Percent Confession

Last time I told you the AI industry is paying a tax it doesn’t have to pay — that a great deal of what we grandly call “AI” is really just looking things up, and we’ve chosen to do that looking-up on the most expensive silicon ever manufactured. A number of you wrote to say I was overstating it. Surely, you said, the people setting hundreds of billions of dollars on fire know something I don’t.

So this week I won’t argue with you. I’ll let one of the largest companies in enterprise software argue with you instead — because it already has, in a research paper it published itself and seems to have hoped you wouldn’t read too closely.

The company is Salesforce. The same Salesforce selling you “agents,” an “agentic enterprise,” a tireless digital workforce to set beside your human one. While one part of the building handled the marketing, another part — Salesforce AI Research, the people whose job is to measure things rather than sell them — built a test to find out how well today’s best AI can do something gloriously unglamorous: find the right piece of information when it’s scattered across the mess of a normal company. Slack threads. GitHub. Meeting transcripts. Documents nobody filed correctly. The stuff every real business actually runs on.

They named it HERB — the Heterogeneous Enterprise RAG Benchmark — and they didn’t build it on the cheap. It’s a synthetic but painstakingly realistic company: 530 employees across 30 products, generating 39,190 documents, messages, transcripts, and pull requests, strewn about the way they really would be. The paper is on arXiv. The data is on Hugging Face. Anyone can check my arithmetic, which is exactly why I’m happy to build a column on it.

Now, the number.

When Salesforce turned the best agentic retrieval systems money can buy loose on HERB — top-tier models, the good stuff, with planning and tool use — they scored 32.96 out of 100. (Thirty-three, if we’re being precise; I rounded down for the headline.)

A third. On a test of finding information that is definitely, provably somewhere in the building. Two times out of three, the most advanced AI on the market went hunting for an answer that existed and came back with the wrong one — or with confident nonsense.

Sit with that, because two floors up the marketing department is selling you an autonomous digital employee, and the research department just published evidence that the digital employee finds the right file about a third of the time.

But the score isn’t the part that should keep you up at night. Two findings underneath it are.

The first is the diagnosis Salesforce’s own researchers wrote down: the bottleneck isn’t the thinking, it’s the finding. The models could reason fine — they simply couldn’t retrieve the right material to reason over. The proof is brutal in its simplicity. When the researchers stopped making the system hunt and instead handed the model the company’s documents outright, the best one leapt from that miserable third to 76.55. Same model. Same questions. The only thing that changed was whether it had to find the evidence or was handed it.

Read that twice, because it’s the most important sentence published in enterprise AI this year and almost nobody noticed: the model was never the problem. The expensive part — the giant, GPU-devouring brain everyone is mortgaging the next decade to buy more of — is sitting there perfectly capable, tapping its foot, waiting for the cheap, dull, unglamorous retrieval layer to bring it the right paragraph. And the retrieval layer can’t.

This is the whole ballgame, and it lands exactly where I left you last time. I claimed two-thirds of enterprise AI is really retrieval wearing intelligence as a costume. Here is Salesforce — not a friendly witness, but a company whose entire pitch depends on the opposite being true — confirming that retrieval is precisely where the enterprise falls apart, and that a bigger, smarter, hungrier model does not rescue you, because the model was already good enough.

The second finding is the one I find most damning, and it’s hiding in the dataset’s own structure. Of HERB’s 1,514 questions, only 815 have answers. The other 699 — nearly half — are unanswerable by design. Salesforce deliberately wrote hundreds of perfectly reasonable-sounding questions for which no supporting evidence exists anywhere in the simulated company, and then watched to see whether the AI would admit it didn’t know.

Think about what that means. HERB isn’t only a test of whether a system can find the answer. Nearly half of it is a test of whether the system knows when there isn’t one — whether, handed a plausible question and no facts to support it, it has the spine to say “I can’t find that” instead of manufacturing something that sounds right. That is the single most important behavior an enterprise needs from AI, and the one almost no system on the market...

AI: The Thirty Percent Confession

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy