Deconstructing Datalog

In September 2022, after two rounds of revisions, I submitted the final version of my PhD dissertation, Deconstructing Datalog. Datalog is a logic programming language from the ’80s that augments relational algebra with recursive queries. It has both simple semantics and efficient implementation strategies. Like Lisp and the Velvet Underground, its influence exceeds its popularity; its ideas are still being absorbed into the mainstream.

While Datalog has a high power-to-weight ratio, it is fairly limited. For one thing, it has no functions or procedures: no way to abstract out repetitive code. As a fan of typed functional programming, I figured: how hard could it be? We have x (bottom-up logic programming); we have y (functional programming); what is x + y? Like a child playing with Legos, I set out to mash two cool things together to make a bigger, cooler thing: Datafun.

It worked!

However,

there were complications.

And if you want the whole story, you’ll just have to read it. It’s a reasonably short dissertation: 97 pages plus end-matter. But the primary results are that (i) this wacky idea works, and (ii) we can make it go reasonably fast, asymptotically speaking.

tl;dr, I’ll do my best to satisfy you here by describing what I see as the two central results, namely (1) that this crazy idea works at all and (2) that we can make it go reasonably fast, asymptotically speaking. -->

tl;dr of my dissertation. Instead, I aim to tease out unfinished business and unforeseen connections. !-- I hope at least one reader finds this useful. -->

Taking Datalog apart and putting it back together again

My dissertation’s central idea is that we can seamlessly integrate Datalog’s features into a typed functional language by working backward from their semantics. Take Datalog’s most distinctive feature, recursive queries; for instance, reachability in a graph:

reachable(start). reachable(Y) :- reachable(X), edge(X,Y).

This finds the smallest set reachable such that (1) start is reachable, and (2) if X is reachable and has an edge to Y, then Y is also reachable. We can rewrite these conditions as a single inequality on sets:

reachable ⊇ {start} ∪ {y : x ∈ reachable, (x,y) ∈ edge}

We can rewrite this to factor out the right-hand side:

reachable ⊇ f(reachable) where f(R) = {start} ∪ {y : x ∈ R, (x,y) ∈ edge}

This asks for a least prefix point: the least R that includes at least f(R) for some function f on relations. In fact, all recursive queries in Datalog fit this pattern. So to capture recursive queries in a functional language, we need (a) an expressive language for functions over relations, and (b) a least prefix point operator, fix. This turns our query into:

fix (λR. {start} ∪ {y | x ∈ R, (x,y) ∈ edge})

In a few short steps, we’ve turned predicates-and-logic into sets-and-functions.

Chapter 2 works out this recipe in detail. Most uniquely, to capture Datalog’s stratification condition, which ensures recursive queries are well-defined, Datafun needs to track monotonicity in its type system. This works compositionally: the composition of two maps is monotone if both maps are monotone; otherwise not. More technically, non-monotonicity can be represented in an otherwise monotone world by (in category theory) a monoidal comonad or (in type theory) a necessity modality or (in the compiler) carefully tracking which variables are safe to use non-monotonically. Like most type systems this is safe but incomplete — the type-checker rejects some monotone programs as non-monotone — but it handles everything Datalog can, and perhaps more.

How to find fixed points fast(er)

An important detail omitted above is fix’s implementation: how we find the least set R such that R ⊇ f(R). Take graph reachability, for instance:

f(R) = {start} ∪ {y | x ∈ R, (x,y) ∈ edge}

The naïve approach is to iterate f:

R₀ = ∅ = nothing R₁ = f(∅) = {start} R₂ = f(f(∅)) = nodes within 1 edge of start R₃ = f(f(f(∅))) = nodes within 2 edges of start ... Rᵢ = fⁱ(∅) = nodes within i-1 edges of start ...

until eventually Rᵢ = Rᵢ₊₁ because we have found the most distant reachable nodes.

This is an extremely inefficient breadth-first search, because it does redundant work. Consider the set-comprehension {y | x ∈ Rᵢ, (x,y) ∈ edges} needed to compute f(Rᵢ). This examines every node in Rᵢ; so in total we do at least as much work as the sum of the sizes of each Rᵢ. R_i, not their sum. --> Now observe that the sequence Rᵢ grows monotonically, R₀ ⊆ R₁ ⊆ R₂ ⊆ ..., because a node within 1 edge of start is also within 2 edges, et cetera. So if we reach a node in an early iteration, we wastefully re-examine it in every following iteration. On some graphs this produces quadratic blowup (e.g. let edge = {(i, i+1) | i ∈ [1..n]}), each node being examined on average a linear number of times!

The solution is to examine each node only once. Naïve iteration asks “what can I deduce in at most n steps?” for iteratively increasing n....

Deconstructing Datalog

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs