Deconstructing Datalog
In September 2022, after two rounds of revisions, I submitted the<br>final version of my PhD dissertation, Deconstructing Datalog.<br>Datalog is a logic programming language from the ’80s that augments relational algebra<br>with recursive queries. It has both simple semantics and efficient implementation<br>strategies. Like Lisp and the Velvet Underground, its influence exceeds its popularity;<br>its ideas are still being absorbed into the mainstream.
While Datalog has a high power-to-weight ratio, it is fairly limited. For one<br>thing, it has no functions or procedures: no way to abstract out repetitive<br>code. As a fan of typed functional programming, I figured: how hard could it be?<br>We have x (bottom-up logic programming); we have y (functional programming);<br>what is x + y? Like a child playing with Legos, I set out to mash two cool<br>things together to make a bigger, cooler thing: Datafun.
It worked!
However,
there were complications.
And if you want the whole story, you’ll just have to read it. It’s a reasonably<br>short dissertation: 97 pages plus end-matter. But the primary results are that (i) this<br>wacky idea works, and (ii) we can make it go reasonably fast, asymptotically speaking.
tl;dr, I’ll do my best to satisfy you here by describing what I see as the two central results, namely (1) that this crazy idea works at all and (2) that we can make it go reasonably fast, asymptotically speaking. -->
tl;dr of my dissertation. Instead, I aim to tease out unfinished business and unforeseen connections.<br>!-- I hope at least one reader finds this useful. -->
Taking Datalog apart and putting it back together again
My dissertation’s central idea is that we can seamlessly integrate Datalog’s features into a typed functional language by working backward from their semantics. Take Datalog’s most distinctive feature, recursive queries; for instance, reachability in a graph:
reachable(start).<br>reachable(Y) :- reachable(X), edge(X,Y).
This finds the smallest set reachable such that (1) start is reachable, and (2) if X is reachable and has an edge to Y, then Y is also reachable.<br>We can rewrite these conditions as a single inequality on sets:
reachable ⊇ {start} ∪ {y : x ∈ reachable, (x,y) ∈ edge}
We can rewrite this to factor out the right-hand side:
reachable ⊇ f(reachable)<br>where f(R) = {start} ∪ {y : x ∈ R, (x,y) ∈ edge}
This asks for a least prefix point: the least R that includes at least f(R) for some function f on relations. In fact, all recursive queries in Datalog fit this pattern. So to capture recursive queries in a functional language, we need (a) an expressive language for functions over relations, and (b) a least prefix point operator, fix. This turns our query into:
fix (λR. {start} ∪ {y | x ∈ R, (x,y) ∈ edge})
In a few short steps, we’ve turned predicates-and-logic into sets-and-functions.
Chapter 2 works out this recipe in detail. Most uniquely, to capture Datalog’s stratification condition, which ensures recursive queries are well-defined, Datafun needs to track monotonicity in its type system.<br>This works compositionally: the composition of two maps is monotone if both maps are monotone; otherwise not.<br>More technically, non-monotonicity can be represented in an otherwise monotone world by<br>(in category theory) a monoidal comonad or<br>(in type theory) a necessity modality or<br>(in the compiler) carefully tracking which variables are safe to use non-monotonically.<br>Like most type systems this is safe but incomplete — the type-checker rejects some monotone programs as non-monotone — but it handles everything Datalog can, and perhaps more.
How to find fixed points fast(er)
An important detail omitted above is fix’s implementation: how we find the least set R such that R ⊇ f(R). Take graph reachability, for instance:
f(R) = {start} ∪ {y | x ∈ R, (x,y) ∈ edge}
The naïve approach is to iterate f:
R₀ = ∅ = nothing<br>R₁ = f(∅) = {start}<br>R₂ = f(f(∅)) = nodes within 1 edge of start<br>R₃ = f(f(f(∅))) = nodes within 2 edges of start<br>...<br>Rᵢ = fⁱ(∅) = nodes within i-1 edges of start<br>...
until eventually Rᵢ = Rᵢ₊₁ because we have found the most distant reachable nodes.
This is an extremely inefficient breadth-first search, because it does redundant work. Consider the set-comprehension {y | x ∈ Rᵢ, (x,y) ∈ edges} needed to compute f(Rᵢ). This examines every node in Rᵢ; so in total we do at least as much work as the sum of the sizes of each Rᵢ. R_i, not their sum. --> Now observe that the sequence Rᵢ grows monotonically, R₀ ⊆ R₁ ⊆ R₂ ⊆ ..., because a node within 1 edge of start is also within 2 edges, et cetera. So if we reach a node in an early iteration, we wastefully re-examine it in every following iteration. On some graphs this produces quadratic blowup (e.g. let edge = {(i, i+1) | i ∈ [1..n]}), each node being examined on average a linear number of times!
The solution is to examine each node only once. Naïve iteration asks “what can I deduce in at most n steps?” for iteratively increasing n....