Stealing from Biologists to Compile Haskell Faster - Ian Duncan
Navigation
This started when someone mentioned, mostly in passing, that GHC has a flag for ApplicativeDo (-foptimal-applicative-do) that’s switched off by default because the algorithm behind it is too slow to use.
That sounded like a bug to me. An optimization that’s correct but disabled for being slow is the kind of thing you fix in an afternoon, I figured.
It wasn’t; it turned out to be a properly hard problem, and the problem has been eating at me for months.
ApplicativeDo is a quiet corner of GHC to start with. Most programs never switch it on, and most of the ones that do are fine with the default and<br>never reach for the optimal flag, so we’re well into the weeds even by compiler-internals standards. But the problem under the flag is a good one<br>, and chasing improvements to the algorithmic compexity for the optimal layout algorithm led somewhere I didn’t expect: the same dynamic program<br>biologists use to predict how a strand of RNA folds.
What is ApplicativeDo, and why is it cool?
Consider a canonical example of fetching the number of friends two people have in common from a remote social graph:
numCommonFriends x y = do<br>fx friendsOf x<br>fy friendsOf y<br>return (length (intersect fx fy))<br>The standard do desugaring turns this into sequential binds:
friendsOf x >>= \fx -><br>friendsOf y >>= \fy -><br>return (length (intersect fx fy))<br>This is strictly sequential. The second friendsOf call cannot start until the first one completes, because >>= passes the result of the left side to the right side. Even though fy does not depend on fx, >>= has no way to express that independence. We pay for two distinct network round-trips.
These two fetches can be written with Applicative:
(\fx fy -> length (intersect fx fy)) friendsOf x friendsOf y<br>The independence is now encoded in the types. has the type f (a -> b) -> f a -> f b, meaning the second argument cannot depend on the first. A monad like Haxl exploits this by batching computations combined with into a single round. Two friend-list lookups become one trip to the database.
Writing the version by hand is fragile. Adding a dependency between two statements requires refactoring back to >>=; removing one means you might leave performance on the table if you forget to switch to . ApplicativeDo allows developers to write plain do notation while the compiler determines where it can legally use .
The compiler’s task is to take a sequence of statements, determine the dependencies, and group independent statements under , falling back to >>= only when a dependency forces it.
The cost of optimal scheduling
Grouping independent statements isn’t always straightforward, and different groupings yield different performance characteristics. Consider this block:
do<br>aFriends friendsOf alice<br>aInfo profilesOf aFriends -- needs aFriends<br>bFriends friendsOf bob<br>bInfo profilesOf bFriends -- needs bFriends<br>score compatibility aFriends bInfo -- needs aFriends and bInfo<br>return (aInfo, bInfo, score)<br>Tracing the dependencies: aInfo needs aFriends; bInfo needs bFriends; score needs both aFriends and bInfo. The two friendsOf calls are entirely independent.
Because Haxl batches everything under a into one round, the metric we care about is the total number of rounds. Fewer rounds mean lower latency.
GHC’s default algorithm works greedily: it grabs the longest run of independent statements from the front, emits it, and repeats. Starting from the front, aFriends cannot be grouped with aInfo (which depends on it), so the first group is aFriends alone. This greedy decision pushes everything else a round later, resulting in four total rounds.
The optimal plan lines the two halves up side by side. The two friend-lookups share round 1, the two profile-lookups share round 2, and score waits for both in round 3. This saves a full round-trip.
The optimal algorithm finds this three-round schedule, but it is O(n³). In the original paper that introduced the feature (Desugaring Haskell’s do-notation Into Applicative Operations), a worst-case do block with 1,000 sequential statements took 55 seconds to compile using the optimal algorithm, compared to under two seconds for the greedy heuristic. Consequently, the optimal algorithm lives behind a flag that is rarely enabled.
The goal is to achieve the optimal answer without paying the O(n³) cost.
Reducing the problem
The statements and their contents are irrelevant; only the dependencies matter. We can model the statements as a sequence of nodes, with directed edges representing dependencies:
The compiler builds a tree over these statements without reordering them. Every node is either a sequential or parallel combination:
data Tree = Leaf Stmt | Seq Tree Tree | Par Tree Tree<br>Par is only legal when its two sides have no dependencies between them. The cost of a tree is the number of rounds it takes:
cost (Leaf _) = 1<br>cost (Seq l r) = cost l + cost r -- sequential: rounds add up<br>cost...