Evaluation order and nontermination in query languages
Last month I gave a talk at FLOPS about finite functional programming or ‘λFS’, my latest attempt (previously: Datafun) to combine functional programming with relational programming à la Datalog or SQL – and tensor algebra too, because why not?
In λFS, a relation R is treated as a function: we let R(x) = true iff x is in R, false otherwise. This function is finite in that R(x) = true for only finitely many inputs x: these are its support. The paper gives a type system for ensuring functions have finite support. At runtime, we represent a finite boolean function by tabulating its support in a hash table or a balanced tree (like a table in a database).
This generalizes beyond booleans to any codomain with a ‘default’ value.<br>A function’s support are those inputs whose output is non-default; for instance, the booleans with false as default, or the integers with 0 as default.<br>A finite function can then be represented as a key-value table whose keys are the function’s support; any other key implicitly maps to the default value.<br>Non-boolean finite maps arise naturally as the result of aggregations, while tensor algebra is essentially relational algebra over non-boolean (usually real-valued) finite maps.
Nontermination and evaluation order
In the talk, I mentioned that recursion was future work for λFS. Naturally, someone – I think it was Atsushi Igarashi – asked what the difficulty was. The primary difficulty is that recursion allows nontermination. Semantics for nontermination usually use domain theory, which has fallen out of fashion in recent decades, and which consequently I’ve never deeply studied and find a bit intimidating.1
Moreover, as soon as nontermination or any effect is involved, there is a question of evaluation order. Even in the untyped lambda calculus, where evaluation is confluent, call-by-name and call-by-value evaluation terminate on different programs. What’s worse, in relational languages ‘evaluation order’ has a more expansive meaning than in functional languages. Consider this example:
R, S, Q : nat => bool -- finitely supported; tables<br>test : nat -> bool -- not finitely supported; a procedure<br>Q(x) = R(x) and test(x) and S(x)
Here, => denotes a finite function, represented as a table, while -> denotes an arbitrary function, represented by a procedure or closure. Thinking relationally, our query Q is the intersection of the tables R and S, filtered by the black-box predicate test.
The central question is: in what order do we evaluate this query? We could go left-to-right: for each x in R, if test(x) passes, check if x is in S. In Python:
[x for x in R if test(x) if x in S]
Or, noting that test may be expensive but looking up an entry in a table is cheap, we could check if x is in S before calling test:
[x for x in R if x in S if test(x)]
And if we happen to know S is much smaller than R, we can save a lot of time by iterating over S instead of R:
[x for x in S if x in R if test(x)]
The standard relational philosophy regards evaluation order as an implementation detail, decided by the query planner. However, this ordering may affect whether Q terminates, because different orders call test on different arguments. A contrived example:
R = {"hello", "world"}<br>S = {"hello", "alice"}<br>test(x) = x == "hello" or loop-forever()<br>Q(x) = R(x) and test(x) and S(x)
Left-to-right execution calls test("hello"), which succeeds, followed by test("world"), which loops forever. But if we intersect R and S before calling test, we only test "hello" and so terminate.
I have expressed this problem in λFS, but it arises in any SQL or Datalog engine supporting user-defined functions, and exposes a tension between typical DB and PL assumptions:
DB The query engine may choose the execution strategy to optimize performance.
PL We can reason compositionally about program behavior – in particular, about termination or execution time.
λFS tries to bridge DB and PL, and gets stuck in the awkward middle: I must either favor one perspective over another, or find a way to thread the needle’s eye. I do not yet know how best to do this. Instead, here are three directions forward – three sketches of possible semantics for λFS with recursion (or Datalog with nonterminating functions).
1 Left to right evaluation has a simple cost model
The simplest strategy to implement also has the most straightforward semantics: evaluate from left to right.<br>This is the most typical PL approach: it avoids query planning and puts the job squarely in the hands of the programmer.<br>This makes the execution time of a query easy to predict and reason about, as my friend Rob Simmons pointed out to me while working on Finite Choice Logic Programming; this is more or less the cost model of Dusa, an implementation of FCLP. my current λFS prototype does – though I am not at all confident of its asymptotic performance! -->
In this approach, a conjunction of n terms turns into at...