How Semantic SQL Works - Cube BlogMay 21, 2026Data Engineering<br>Pavel Tiunov<br>PhD, Co-founder and CTO at Cube
Table of Contents
It's all about AI
Why not tables or SEMANTIC_LAYER.md?
Query language for the semantic layer
So what's Semantic SQL exactly?
Why E-Graphs?
What's next
Subscribe to Newsletter<br>Get Cube updates to your inbox for building better data products.
Subscribe
In a recent post, Artyom described why SQL is the right communication protocol for a standalone semantic layer and outlined the fundamental challenge: SQL evaluates bottom-up, but measures in a semantic layer need context from the outer query to aggregate correctly. He mentioned that we built a term rewrite system based on E-Graph theory to solve this. This post is the technical companion: how Semantic SQL works, why it needs E-Graphs, and what comes next.
It's all about AI
The idea of a semantic layer is not new. Business Objects introduced the concept in the early '90s with its "universe," a metadata mapping that sat between end users and the database. The user would pick "Revenue" from a dropdown, and the universe would translate that into the right SQL with the right joins, filters, and aggregation. SAP later acquired Business Objects and carried the same idea forward. MicroStrategy, Cognos, and practically every enterprise BI tool of that era had some variation of it: a governed metadata layer that decoupled business questions from physical schemas.
The reason the semantic layer was invented wasn't that people couldn't write SQL. It was that a lot of people writing SQL against the same database would get different data back. Two analysts computing "quarterly revenue" would write two subtly different queries (one includes refunds, the other doesn't; one filters on invoice date, the other on payment date) and produce two numbers that they'd each present with equal confidence. The semantic layer existed to make "quarterly revenue" mean exactly one thing, defined once, reused everywhere.
That problem went quiet for a while. The rise of cloud data warehouses in the 2010s centralized the storage, but pushed metric definitions out to individual BI tools, dbt models, or ad-hoc notebooks. Different teams ended up with different definitions again, just in different places.
Now, with AI agents writing SQL on behalf of users, the same problem is back, and arguably worse. When a human analyst writes an incorrect query, they can usually notice if the result looks off. An LLM can't. It will confidently produce a syntactically valid query that computes the wrong number and present it as an answer. The semantic layer was the solution to this problem thirty years ago, and it's the solution now: give the agent a governed set of metrics and dimensions to query against, rather than letting it guess from raw table schemas.
Why not tables or SEMANTIC_LAYER.md?
If the goal is just to tell an AI agent what metrics exist and how they're defined, there's a simpler-sounding approach: expose the semantic layer as a set of tables, or hand the agent a text file (a markdown doc, a YAML spec, a system prompt) that describes the available metrics, their definitions, and the tables they come from. The agent reads the description, generates SQL against the actual tables, and you're done.
This is what a lot of early AI-for-analytics prototypes actually do. It works for simple cases. But it breaks down in two distinct ways.
The evaluation problem. OLAP queries that involve measures (things like completed_percentage, revenue_per_user, or any ratio/percentage metric) cannot be correctly mapped to a flat table interface. The reason is fundamental: SQL evaluates bottom-up (inner subqueries first, then outer), while OLAP measures need top-down context to aggregate correctly. A measure like revenue_per_user isn't a column you can select from a table; it's a computation that depends on the grouping context of the query that references it. Exposing measures as table columns forces the agent to pick a single aggregation level, and that level is almost certainly wrong when the query contains subqueries or window functions. We'll go deeper on this in the next sections.
The guardrails problem. When you give an agent a text description and let it generate SQL directly against the warehouse, there are no structural guardrails, neither for correctness nor for security. The agent can hallucinate a join that doesn't exist. It can reference a column that the user shouldn't have access to. It can write a SELECT * that scans a 500M-row fact table. A text description is informational, not enforceable. The semantic layer, on the other hand, is both: it defines what's available and it enforces how it can be queried. Row-level security, column visibility, allowed aggregations: all of these are checked at query time, not by asking the agent to please respect them.
Query language for the semantic layer
If the semantic layer needs a query interface (not just a description),...