We gave our agent the exact metric definition. It still wrote the wrong SQL

Anthropic and OpenAI both said context is the bottleneck for data agents. Here's what they didn't say. | ClariLayer Blog AI AgentsMetric Governance Anthropic and OpenAI both said context is the bottleneck for data agents. Here's what they didn't say. Kyle Hui·June 9, 2026 Anthropic and OpenAI recently published unusually candid accounts of how they make data agents accurate on their own internal data. Anthropic's "How Anthropic enables self-service data analytics with Claude" measures what moved their eval numbers. OpenAI's "Inside our in-house data agent" describes the context architecture behind an agent serving more than 3,500 internal users across 600+ petabytes. Different angles, same conclusion: the bottleneck is not SQL generation. It is context — how it is structured, routed, and maintained as the warehouse drifts underneath it. We build a small product on exactly this premise, and on their platforms: it lives inside the agents analysts already use — Claude Code, Cursor, and Codex — connected over MCP. So this is not a rebuttal; both posts are right. But they share an assumption that leaves most working analysts out of the story, and once you build past the failure modes they describe, you meet new ones neither post covers. These are our field notes. What both posts establish Two of Anthropic's results deserve quoting at anyone who says "just give the agent more context." First, procedural knowledge dominated: without skills — files encoding how to work, which sources to consult, in what order — accuracy "didn't exceed 21%" on their evals; with them, "consistently above 95%." Second, the anti-RAG ablation: they gave the agent grep access to their entire SQL corpus — thousands of dashboard, transformation, and notebook files — and accuracy moved by less than a point, even though the right answer was present roughly 80% of the time. Their summary: "the information was there, the agent saw it, and it still didn't use it." The bottleneck was not access; it was structure — mapping a question to the right entity. They also put a number on drift: ~95% at launch fell to ~65% over a month, until maintenance became an engineering practice. OpenAI supplies the architecture view: six layers of context — usage patterns, human annotations, code-derived enrichment, institutional knowledge, memory, runtime lookups — including a nightly job that reads pipeline code, on their lesson that meaning lives in code, and a memory that retains the non-obvious corrections, filters, and constraints that decide correctness. Two of their lessons generalize far beyond their scale: fewer, stronger tools beat a full toolbox — exposing the full tool set made the agent less reliable — and guiding the goal beats prescribing the path — highly prescriptive prompting degraded results. All of it built by roughly two engineers — this is now buildable. The person both posts assume away Look at what the two architectures presuppose: canonical datasets. A semantic layer as the first source of truth. Owners for every domain. A data team to keep skill files fresh and wire review hooks into CI. Even Anthropic's start-from-zero advice — "a handful of canonical datasets, a few dozen offline evals, and a thin knowledge skill" — is a data-team project. These are blueprints for organizations that already have a data organization. Most analysts do not work at one. The person we build for is the individual analyst — RevOps, analytics engineering, the one data person at a company that never had two — who lives in Claude Code, Cursor, or Codex and works alone against a messy, ungoverned warehouse: no semantic layer, no canonical datasets, nobody to file a ticket with, and a recurring moment that eats afternoons: why don't these two numbers match? ClariLayer is an individual-analyst context layer (MCP-delivered): the thesis both labs validated, productized for one person with no platform team behind them. The difference changes the engineering: at lab scale, process fixes context problems — review hooks, owners, marketplaces. Alone, the layer itself has to do that work. The failure we ran into A framing note, because we would rather under-claim: these are single paired runs on a synthetic warehouse with a real coding agent — built to find failure modes, not to measure rates. First we ran into the failure class behind Anthropic's ablation, at the opposite end of the scale. We gave an agent full warehouse access and served it context including the exact checked metric contract for the question — a definition reconciled against the warehouse, grain and filters spelled out. It wrote wrong SQL anyway. It followed a stale prose note served right beside it: it added bonus_eligible_flag = true, a filter the contract does not list, and wrapped the time column in DATE_TRUNC('month', ...) where the contract declares an integer revenue_month — that stale note is exactly what a hand-tended CLAUDE.md becomes by month three. The same agent, the same day, with our...

We gave our agent the exact metric definition. It still wrote the wrong SQL

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs