From Benchmarketing to Benchmaxxing: What 40 Years of Database Evals Can Teach Data Leaders about AI
OpinionFrom Benchmarketing to Benchmaxxing: What 40 Years of Database Evals Can Teach Data Leaders about AI<br>Kostas Pardalis<br>Co-Founder
April 13, 2026
AI turned the whole tech industry into benchmarking addicts.
Benchmarking is nothing new to me, I've seen it used both as a sales and marketing tool and as part of the engineering process. But the scale and the obsession that people got into it with AI is on a completely different level.
I've been building data infrastructure for more than 10 years now and most recently I've been building agentic systems for data and platform engineers at Typedef. To do that reliably, I had to build my own internal eval system because nothing off the shelf could evaluate what we were building.
I've also seen benchmarketing 1 2 3 4, benchmarking that turned into vendor warfare, but nothing compared to what is happening today with AI. The noise is so bad that it's reasonable people have started losing faith in the performance claims they see out there.
People are not crazy though, there is a good reason for all the interest in these benchmarks and evals and it's not that different compared to why benchmarks have been around for so long in the database world.
There's a lot to learn from that. What I'll do here is walk you through the history of how the database industry dealt with the exact same problem, show you why the same patterns are repeating in AI at an even greater scale and make the case of why you can and you should build your own evals to due diligence the vendors you are interacting with.
By the end you will have a playbook for turning vendor benchmarks from their marketing tool into your due diligence tool!
The problem
There's pressure to deliver the promises of AI through your team. It's 2026, we've been promised a brand new world of unparalleled performance by incorporating AI in everything and as the person responsible for your data team's tooling, you're the one who has to figure out which of these promises are real.
It's not only pressure from above though, it's also pressure from peers. Look at all these product engineers, how they churn out new front-end features in 1/10 the time of before and they do it by using these silly looking terminal tools like Claude Code. How can we do the same as data practitioners?
We had projects to deliver agentic analytics, that would actually put the data org in an amazing position when delivered. Imagine every business user being able to answer any question they have on Slack without having to open a ticket and wait a few engineering cycles before they could get a dashboard.
But it seems that it's safe to hallucinate code because there's a compiler and then also an engineer to figure out the issue before it hits production but you can't afford to hallucinate business metrics and these models when left unsupervised on a realistic data warehouse to run arbitrary SQL, do hallucinate a lot 5 6!
But wait, we can go and use semantic layers instead of raw SQL, this will solve the problem! And there's evidence that it helps significantly 7 8. Well, in theory yes but in practice the agent will be as good as your data modeling on another layer of abstraction.
But regardless of the approach, the benchmarks and evals that vendors use to make their case were never designed for your workload.
Now we have something that is not designed for your needs, used to prove to you that a tool is going to work well for you, and guess what, it might convince you to buy and then you will face the hard reality of the tool not delivering what it was promised.
The issue is even more exaggerated in data platforms and AI because there's a complete void there. The benchmarks just do not exist today, although this is slowly changing as we will see a bit later.
But even if the benchmarks were perfect, there's an important learning from the decades of using benchmarks in databases for evaluating vendors.
There's no objective benchmark that can capture the needs of your organization and represent your workload and guess what, that stands true for AI tools too.
Sure, the latest frontier model can impress Dr. Knuth 9 on assisting on a proof, but the same model when faced with your pipelines that broke because a salesforce admin changed the currency of a column without letting you know, will end up in existential crisis.
The problem is that today, someone will use the former to convince you to use LLMs to turn your team into 10x data engineers but will most probably avoid mentioning that the latter can also happen.
We can learn from the past though, and I'm here to show you how!
But first, let's do a small history lesson.
The benchmarketing wars of the database vendors
In the early to mid 1980s, the database market exploded from IBM mainframe dominance into a crowded field of competing vendors like Oracle, Sybase and others who all run on...